Observability & tracing
You can't debug an agent you can't see. Traces capture every LLM call, tool call, token, and millisecond of a run — so you can find the retry loop, the wrong tool, and the cost spike.
What you'll learn
- What a trace captures and why agents need one
- How to read a trace tree to find loops, wrong tools, and cost spikes
- The tooling — LangSmith, Langfuse, Arize — and OpenTelemetry
Before you start
An agent run is a tree of nested calls — the agent calls the model, which picks a tool, which returns a result, which triggers another model call, and so on, sometimes dozens deep. When something goes wrong (a wrong answer, a $5 query, a 30-second hang), “read the final output” tells you nothing. You need to see the whole run. That’s observability, and for agents it’s not optional — it’s how you debug, evaluate, and control cost.
What a trace captures
A trace records the full execution as nested spans — one span per operation, each with timing, inputs, outputs, token counts, and cost:
With that view, the bugs that are invisible in the final output jump out:
- A retry loop burning latency and tokens (the highlighted span above).
- The wrong tool called, or a tool called with bad arguments.
- A cost spike — which span spent the tokens, and on what.
- An agent looping instead of converging, hitting its step budget.
What to instrument
Log, for the whole run and every span: inputs/outputs, the model and prompt version, token counts (prompt + completion + any reasoning), latency, cost, tool name + arguments + result, and errors/retries. Tie it all to a trace id so a single run is one searchable object.
Quick check
Quick check
Next
Tracing also surfaces where the money goes — the bridge to cost & latency control.
Practice this in an interview
All questionsTool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.
Function calling lets an LLM output a structured request to invoke an external function with arguments, which the runtime executes and feeds back, enabling agents to act in the world. Good tool design uses clear names and descriptions, minimal well-typed parameters, narrow single-purpose scope, least privilege, and informative error messages so the model can choose and call them reliably.
An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.
Without labels, alerting relies on three proxy signal layers: input distribution tests, output score distribution tests, and business proxy metrics. You define thresholds on each layer pre-deployment and set up automated alerts so that degradation triggers investigation before it compounds.