How do you evaluate an agentic system, and what is the difference between trajectory and outcome evaluation?

Outcome evaluation checks whether the agent's final result is correct, while trajectory evaluation inspects the intermediate steps, tool calls, and decisions along the way. You need both because an agent can reach the right answer through a flawed path or fail despite sound reasoning; trajectory metrics catch wrong tool use, redundant steps, and loops that outcome-only metrics miss.

What is an AI agent, and how does it differ from a single LLM call?

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

Explain the ReAct agent pattern and how it compares to Plan-and-Execute and Reflexion.

ReAct interleaves reasoning traces with actions step by step, deciding the next tool call based on the latest observation. Plan-and-Execute first drafts a full multi-step plan and then executes it, which is more efficient and predictable for complex tasks but less adaptive, while Reflexion adds a self-reflection step where the agent critiques past failures and retries with that feedback.

Evaluating agents — Agentic AI

Evaluating agents

An agent has a loop, side effects, and non-deterministic intermediate states — so a single input/output check isn't enough. Evaluate the trajectory and the outcome, build eval sets, and catch regressions when you upgrade a framework or model.

8 min read Intermediate Agentic AI Lesson 58 of 71

A single LLM call has one input and one output — easy to test. An agent has a loop, takes actions with side effects, and passes through non-deterministic intermediate states. Two runs of the same task can take different paths and both be correct, or take the same path and one quietly does something dangerous. That’s why “it worked when I tried it” is not evaluation, and why eval is the discipline that most separates shipped agents from demos.

Outcome vs trajectory

You evaluate an agent on two axes:

Outcome (did it succeed?) — task success, final-answer correctness, the side effect that should have happened did. This is the what.
Trajectory (how did it get there?) — the sequence of steps: did it call the right tools with valid arguments, avoid unnecessary or dangerous actions, stay within a step budget, and not loop? This is the how.

You need both. An agent can reach a correct answer via a wasteful, expensive, or unsafe path — outcome looks fine, trajectory is a problem. And it can take a sensible path but get the final answer wrong.

Score the trajectory (per-step: right tools, valid args, no loops) and the outcome (task success, correctness).

Build an eval set

The mechanics mirror LLM evals, adapted for trajectories:

Collect tasks — real scenarios with a known success condition (and, where you can, the expected tool sequence or key actions).
Run the agent, capturing the full trajectory (every tool call + args + result), not just the final answer.
Score — deterministic checks where possible (did it call refund() with the right order id? did the record actually update?) plus an LLM-as-judge for open-ended quality, with the bias controls from the LLM-evals lesson.
Gate in CI — run the set on every prompt/model/framework change. This is how you catch the regression when, say, LangChain or the model version bumps.

The clearest way to feel the split is to score two runs that reach the same successful outcome by different paths — one clean, one wasteful and out-of-scope. Predict each verdict before the output:

allowed_tools = {"lookup_order", "get_policy", "issue_refund", "escalate"}
step_budget = 6

def report(label, trajectory, task_succeeded):
    valid_args    = all(s["args_valid"] for s in trajectory)
    in_scope      = all(s["tool"] in allowed_tools for s in trajectory)
    within_budget = len(trajectory) <= step_budget
    passed = task_succeeded and valid_args and in_scope and within_budget
    verdict = "PASS" if passed else "FAIL"
    print(f"{label}: succeeded={task_succeeded}  in_scope={in_scope}  "
          f"within_budget={within_budget}  steps={len(trajectory)}  -> {verdict}")

# Same outcome (the refund went through), two different trajectories:
clean = [
    {"tool": "lookup_order", "args_valid": True},
    {"tool": "get_policy",   "args_valid": True},
    {"tool": "issue_refund", "args_valid": True},
]
messy = [                                            # succeeds, but wanders:
    {"tool": "lookup_order", "args_valid": True},
    {"tool": "web_search",   "args_valid": True},    # not in the allowlist
    {"tool": "get_policy",   "args_valid": True},
    {"tool": "get_policy",   "args_valid": True},
    {"tool": "lookup_order", "args_valid": True},
    {"tool": "get_policy",   "args_valid": True},
    {"tool": "issue_refund", "args_valid": True},
]
report("clean", clean, task_succeeded=True)
report("messy", messy, task_succeeded=True)

clean: succeeded=True  in_scope=True  within_budget=True  steps=3  -> PASS
messy: succeeded=True  in_scope=False  within_budget=False  steps=7  -> FAIL

Both runs succeeded — the refund went through either way, so an outcome-only eval would pass both. Only the trajectory eval catches that the messy run wandered out of scope (web_search) and blew the step budget. That gap is the whole reason you score both axes.

In one breath

An agent has a loop, side effects, and non-deterministic intermediate states — “it worked when I tried it” is not evaluation.
Score two axes: outcome (did it succeed / is the answer right — the what) and trajectory (right tools, valid args, no loops or unsafe steps, within budget — the how); a correct answer can still come from a bad path, so you need both.
Build an eval set: collect tasks with known success conditions, run capturing the full trajectory, score (deterministic checks + LLM-as-judge for open-ended), then gate it in CI.
The biggest payoff is on upgrades — a framework/model/prompt bump can silently change behavior; an eval set turns the regression into a red CI check instead of a user report.
Tools (LangSmith, Langfuse, AgentOps) supply the dataset + trajectory scoring, and captured traces seed the eval set.

Quick check

0/3

Q1Why are agents harder to evaluate than a single LLM call?

Q2What's the difference between outcome and trajectory evaluation?

Q3When is an agent eval set most valuable?

You can’t evaluate what you can’t see — observability & tracing captures the trajectories these evals score, and cost control keeps the agent affordable.

Evaluating agents

What you'll learn

Before you start

Outcome vs trajectory

Build an eval set

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further