Evaluating agents
An agent has a loop, side effects, and non-deterministic intermediate states — so a single input/output check isn't enough. Evaluate the trajectory and the outcome, build eval sets, and catch regressions when you upgrade a framework or model.
What you'll learn
- Why agents are harder to evaluate than a single LLM call
- Outcome vs trajectory evaluation, and what to score in each
- Building agent eval sets that catch regressions on upgrades
Before you start
A single LLM call has one input and one output — easy to test. An agent has a loop, takes actions with side effects, and passes through non-deterministic intermediate states. Two runs of the same task can take different paths and both be correct, or take the same path and one quietly does something dangerous. That’s why “it worked when I tried it” is not evaluation, and why eval is the discipline that most separates shipped agents from demos.
Outcome vs trajectory
You evaluate an agent on two axes:
- Outcome (did it succeed?) — task success, final-answer correctness, the side effect that should have happened did. This is the what.
- Trajectory (how did it get there?) — the sequence of steps: did it call the right tools with valid arguments, avoid unnecessary or dangerous actions, stay within a step budget, and not loop? This is the how.
You need both. An agent can reach a correct answer via a wasteful, expensive, or unsafe path — outcome looks fine, trajectory is a problem. And it can take a sensible path but get the final answer wrong.
Build an eval set
The mechanics mirror LLM evals, adapted for trajectories:
- Collect tasks — real scenarios with a known success condition (and, where you can, the expected tool sequence or key actions).
- Run the agent, capturing the full trajectory (every tool call + args + result), not just the final answer.
- Score — deterministic checks where possible (did it call
refund()with the right order id? did the record actually update?) plus an LLM-as-judge for open-ended quality, with the bias controls from the LLM-evals lesson. - Gate in CI — run the set on every prompt/model/framework change. This is how you catch the regression when, say, LangChain or the model version bumps.
Quick check
Quick check
Next
You can’t evaluate what you can’t see — observability & tracing captures the trajectories these evals score, and cost control keeps the agent affordable.
Practice this in an interview
All questionsOutcome evaluation checks whether the agent's final result is correct, while trajectory evaluation inspects the intermediate steps, tool calls, and decisions along the way. You need both because an agent can reach the right answer through a flawed path or fail despite sound reasoning; trajectory metrics catch wrong tool use, redundant steps, and loops that outcome-only metrics miss.
An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.
Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.
ReAct interleaves reasoning traces with actions step by step, deciding the next tool call based on the latest observation. Plan-and-Execute first drafts a full multi-step plan and then executes it, which is more efficient and predictable for complex tasks but less adaptive, while Reflexion adds a self-reflection step where the agent critiques past failures and retries with that feedback.