datarekha

Evaluating agents

An agent has a loop, side effects, and non-deterministic intermediate states — so a single input/output check isn't enough. Evaluate the trajectory and the outcome, build eval sets, and catch regressions when you upgrade a framework or model.

8 min read Intermediate Agentic AI Lesson 36 of 42

What you'll learn

  • Why agents are harder to evaluate than a single LLM call
  • Outcome vs trajectory evaluation, and what to score in each
  • Building agent eval sets that catch regressions on upgrades

Before you start

A single LLM call has one input and one output — easy to test. An agent has a loop, takes actions with side effects, and passes through non-deterministic intermediate states. Two runs of the same task can take different paths and both be correct, or take the same path and one quietly does something dangerous. That’s why “it worked when I tried it” is not evaluation, and why eval is the discipline that most separates shipped agents from demos.

Outcome vs trajectory

You evaluate an agent on two axes:

  • Outcome (did it succeed?) — task success, final-answer correctness, the side effect that should have happened did. This is the what.
  • Trajectory (how did it get there?) — the sequence of steps: did it call the right tools with valid arguments, avoid unnecessary or dangerous actions, stay within a step budget, and not loop? This is the how.

You need both. An agent can reach a correct answer via a wasteful, expensive, or unsafe path — outcome looks fine, trajectory is a problem. And it can take a sensible path but get the final answer wrong.

step 1step 2step 3final answertrajectory eval: right tools? valid args? no loops?outcome eval: task succeeded? answer correct?
Score the trajectory (per-step: right tools, valid args, no loops) and the outcome (task success, correctness).

Build an eval set

The mechanics mirror LLM evals, adapted for trajectories:

  1. Collect tasks — real scenarios with a known success condition (and, where you can, the expected tool sequence or key actions).
  2. Run the agent, capturing the full trajectory (every tool call + args + result), not just the final answer.
  3. Score — deterministic checks where possible (did it call refund() with the right order id? did the record actually update?) plus an LLM-as-judge for open-ended quality, with the bias controls from the LLM-evals lesson.
  4. Gate in CI — run the set on every prompt/model/framework change. This is how you catch the regression when, say, LangChain or the model version bumps.

Quick check

Quick check

0/3
Q1Why are agents harder to evaluate than a single LLM call?
Q2What's the difference between outcome and trajectory evaluation?
Q3When is an agent eval set most valuable?

Next

You can’t evaluate what you can’t see — observability & tracing captures the trajectories these evals score, and cost control keeps the agent affordable.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you evaluate an agentic system, and what is the difference between trajectory and outcome evaluation?

Outcome evaluation checks whether the agent's final result is correct, while trajectory evaluation inspects the intermediate steps, tool calls, and decisions along the way. You need both because an agent can reach the right answer through a flawed path or fail despite sound reasoning; trajectory metrics catch wrong tool use, redundant steps, and loops that outcome-only metrics miss.

What is an AI agent, and how does it differ from a single LLM call?

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

Explain the ReAct agent pattern and how it compares to Plan-and-Execute and Reflexion.

ReAct interleaves reasoning traces with actions step by step, deciding the next tool call based on the latest observation. Plan-and-Execute first drafts a full multi-step plan and then executes it, which is more efficient and predictable for complex tasks but less adaptive, while Reflexion adds a self-reflection step where the agent critiques past failures and retries with that feedback.

Related lessons

Explore further

Skip to content