The eval loop that actually ships agents

The most under-celebrated piece of every successful AI agent team is a spreadsheet.

Sometimes it’s a Notion table. Sometimes it’s a custom internal UI with charts. Sometimes it’s a Jupyter notebook that one engineer keeps opening with mild distrust. But at every team that has shipped an agent product worth using — Anthropic, Anysphere, Replit, Cognition, Cursor, Sourcegraph — there is some artifact that answers the question “is the change you just made a win or a regression?” before it gets to a customer.

That artifact is the eval loop. It is the difference between teams that ship reliable agent improvements every week and teams that ship one demo and then spend a year debugging customer complaints. It is also, in the typical 2024-era startup, the single most under-invested piece of infrastructure.

This post is about what that loop actually looks like when it’s working. Not the textbook version — the production version, with the messy parts included.

Why “vibes-based” shipping breaks at scale

Early-stage agent products live on vibes. The founder dogfoods the product, ships a change, says “yeah that feels better,” and ships it. This works up to a point. The point at which it stops working is the moment any of these become true:

The product has more than one user with a different mental model than the founder. Now the founder’s “feels better” is one data point among many, and not the most important one.
The change touches a part of the system the founder doesn’t use daily. Maybe it’s the SQL agent, and the founder is mostly a Python person.
The change is a model upgrade or a prompt rewrite that affects everything subtly. The aggregate effect is unguessable from a few individual interactions.
A previous “feels better” change subtly regressed something important, and nobody noticed for two weeks.

When you cross any of those thresholds, you need evals. Not for the model researchers — for the product team. The eval loop is the thing that lets a frontend engineer change a system prompt and know whether they helped or hurt the agent’s behaviour before it hits production.

The five stages every shipping team has converged on. Different scales, same shape.

Stage 1 — the golden set

The golden set is the source of truth. It is a curated collection of input cases (a few hundred is usually enough; thousands is a luxury) each tagged with what good behaviour looks like for that case. Not “the exact output” — the expected behavioural property.

A few characteristics of good golden sets, drawn from how the teams that ship organise theirs:

Size is bounded by the cost of running it. If a full eval takes $500 of inference, you have at most a few golden runs per week before the budget complains. So the set is fiercely curated — 200 cases that span the surface, not 20,000 cases that overlap. Anthropic’s internal evals run hundreds of representative tasks; not millions.
Cases are tagged by category. “tool-use”, “long-context”, “multi-turn”, “refusal-correctness”, “code-edit”, whatever your product cares about. The regression dashboard slices by tag, so a win in one category that hides a loss in another becomes visible.
Each case has at least one explicit failure mode it’s checking for. “Did the agent call search_docs for this query? Did it refuse to delete the user’s data without confirmation? Did the generated SQL parse?” These are the things a code metric or a judge can check.
Cases are owned, not adopted. Every case has an engineer or PM whose name is on it. When that case starts behaving weirdly, the owner gets paged.

The trap most teams fall into is treating the golden set as a one-time build. The best golden sets are alive: they grow when new failure modes are found, prune when cases become obsolete, get reviewed in a weekly meeting where someone calls out the cases that have stopped producing useful signal.

Stage 2 — running the agent against the set

Mechanically this is the easy stage: spin up the candidate version of the agent, feed it the cases, save the full trace of each run.

The discipline that matters here:

Save the full trace, not just the final output. Tool calls, model inputs, intermediate steps. When a regression appears, you’ll want to diff the trace, not just the answer.
Run the candidate against the same baseline every time. The baseline is usually “current production” — but it’s pinned to a specific build, not a moving target. Drifting baselines hide regressions.
Run it deterministically when possible, multiple-sample when not. Some agents are nearly deterministic (temperature 0, fixed seed). Some have inherent variance. For the latter, the eval runs N samples per case and reports the distribution, not just one number.

Anysphere has described running their evals on tens of thousands of CPU/GPU hours per release cycle. Most teams will not be at that scale, and don’t need to be. A few hundred traces, reproducibly, beats a million half-baked ones.

Stage 3 — scoring (and the limits of LLM-as-judge)

This is the stage where teams go wrong most often. Scoring breaks into three layers, each with different reliability.

Code metrics are the floor — they’re cheap, deterministic, and catch the dumbest failure modes. Did the agent’s tool call parse as valid JSON? Did the generated code compile? Did the SQL parse and execute? Did the response include a required field? Anything you can check with assert belongs at this layer.

LLM-as-judge is the middle layer. A small structured prompt sent to a strong model: “Given this case’s expected behaviour and the actual agent output, score on dimensions X, Y, Z with brief justification.” Used carefully, this scales to thousands of cases nightly and is genuinely useful.

Used carelessly, it lies to you. The dominant failure modes:

Judges drift. Same prompt + same model + same dataset can score differently month-to-month as the underlying model API updates. Catch this by re-running an LLM judge on a fixed set of cases weekly — if the scores drift on cases that haven’t changed, your judge has changed.
Judges are biased toward style. They’ll score a confidently- worded wrong answer higher than a hedged correct one. Catch this by including some “stylistically perfect but factually wrong” cases in your judge’s calibration set.
Judges get gamed. If your agent’s prompt is “always sound confident,” your judge will reward you. If you optimise against the judge, you’ve optimised against what the judge measures, not what you care about. This is the Goodhart’s Law of LLM evals.

Human spot-checks are the ceiling. A small random sample of every eval run gets a real human read. Not because humans are infallible — because they catch the failure modes the judges miss. Anthropic’s internal evals reportedly route a percentage of every run to human graders specifically to keep the judges honest.

The pattern that works: code metrics catch the obvious bugs, LLM judges catch the typical quality issues, humans catch the things you wouldn’t have thought to write a metric for. None of these layers alone is trustworthy. The combination is.

Stage 4 — the regression dashboard

The dashboard is where the eval becomes a workflow. The shape that ships is roughly always the same:

A baseline pinned at “what’s in production today.”
For each candidate (a PR, a model upgrade, a prompt experiment), a delta per metric and per tag.
A list of cases that went from pass → fail and from fail → pass.
A trace viewer linked from every changed case, so you can see what the agent did differently.

Replit has written about building this kind of dashboard for their agent. Anthropic’s internal tooling has been described in passing in interviews — same shape. Anysphere’s Cursor team has talked about per-PR eval comments on every code change.

       PR #1247 — "more aggressive tool-call retries"
       ─────────────────────────────────────────────────
       baseline:  production @ commit a3f9d2c
       candidate: branch     @ commit 7b1e4f0

       overall pass rate:    82.4%  →  84.1%   (+1.7 pp)

       by tag:
         tool-use            76 → 81   (+5)    ████████████ ↑
         long-context        88 → 88   ( 0)    ────────────
         multi-turn          72 → 71   (-1)    ▒▒▒▒▒▒▒▒▒▒▒▒ ↓
         refusal-correctness 94 → 94   ( 0)    ────────────
         code-edit           80 → 84   (+4)    ████████████ ↑

       flipped fail → pass:  19 cases   [view traces]
       flipped pass → fail:  3 cases    [view traces]

       judge calibration check:  drift = 0.02 (ok)

That last block — “flipped pass → fail” — is the killer feature. Three cases went backwards. The dashboard shows you what they are. You read the traces. Two are noise; one is a real regression that the overall +1.7pp would have hidden. You fix it before merging. That specific moment is the reason the eval loop pays for itself.

Stage 5 — failure mining

The golden set goes stale unless something is feeding it. The healthiest teams have a steady pipeline:

Customer-reported issues triaged into the eval set. Every time a user reports “the agent did the wrong thing for this query,” the query becomes a case.
Production traces sampled for weird behaviour. Long agent runs, expensive runs, runs with unusual tool-call patterns — these are candidates for the eval set even when nobody complained.
Adversarial cases written by the team. “What if a user asks for something we can’t do? What if the tool returns a 500? What if the user contradicts themselves?” These don’t come from customers because customers don’t think to try them. The team has to.

The eval set has a budget — adding cases means removing or merging old ones. Cases that the agent has gotten 100% right for two months get pruned. Cases that have become trivial get merged into broader checks. The set’s size is not the goal; its coverage is.

Anti-patterns from the field

If you read enough engineering postmortems from agent teams, certain anti-patterns recur. None of these are novel; they’re worth listing so you can spot yours.

“We’ll write evals later.” Later never comes, because by the time “later” arrives, you have customers, every change risks a regression, and now you need evals and a backlog of “we should have done this six months ago.”

“The model researchers have evals.” Fine for the model team. The product team has different concerns: prompt changes, tool definitions, system instructions, retry policies. Those need their own eval, owned by the product team.

“LLM-as-judge is good enough; we don’t need humans.” Until your judge silently breaks and you ship a regression for three weeks. Humans are slow but unbiased. Judges are fast and biased in exactly the directions you didn’t think to check. Keep the humans in the loop.

“We don’t have time to maintain the golden set.” This is the same sentence as “we ship regressions every week and don’t catch them.” It is also the easiest one to fix — a 30-minute weekly review meeting ahead of every Friday deploy beats a quarter of customer escalations.

Optimising for the dashboard, not the user. When the eval score is the goal, the eval score gets gamed. The eval set must contain things the user actually cares about, not just things that are easy to measure. Periodically ask: “if we got 100% on this set, would the product be done?”

The trust pyramid. Treat any layer as load-bearing on its own and you’ll be surprised — by exactly the failures the other layers would have caught.

The cultural piece

The hardest thing about an eval loop is not the engineering. It’s the cultural rule:

No change to the agent ships without a measured win on the eval set. No exceptions.

That rule is what turns evals from a science project into a release gate. Anthropic has talked publicly about this discipline being how they ship model changes; Anysphere has talked about it being how Cursor ships features. Replit has built it into their CI. The same rule, the same outcome — the team that enforces it ships faster, not slower, because regressions don’t pile up.

The corollary is also cultural: when the eval set is wrong (it sometimes is), fix the eval set in a separate PR before the feature merges. Don’t let “the eval was wrong” become a shortcut. The eval being wrong is itself a finding that deserves a write-up.

What to take away

Evals are a workflow, not an artifact. The dashboard, the meeting cadence, the merge gate — these are what make the system work. A golden set sitting in a Notebook nobody runs is worth nothing.
Code metrics → LLM judge → humans. All three layers. Always. Each is unreliable alone; the combination is robust.
The golden set is small and curated, not large and assembled. 200 sharp cases beat 20,000 dull ones every time.
Production traces feed the eval set, not the other way around. When the agent fails for a real user, that case enters the set forever.
The rule is “no win, no merge.” It sounds harsh; it’s why the good teams ship faster than the loose teams.

If you take one thing from this piece: the gap between the best agent products in 2026 and the rest is not architecture, not model access, not prompt cleverness. It is the eval loop. Build the eval loop early. Pay for it forever. The teams that did this are the ones who get to look back at two years of agent work and say “yeah, that mostly went well.” Everyone else is still debugging.

Further reading: Hamel Husain’s Your AI product needs evals is the best practitioner-focused intro. Anthropic’s Building Effective Agents post has a section on evaluation in production. Replit’s agent eval writeup and Anysphere’s Cursor blog are the closest things to case studies of these loops in commercial agent products.