Evals that actually work: beyond the LLM-as-judge trap

There is a recurring scene in 2026 AI engineering. A team ships a model update. The internal eval dashboard goes green — all the LLM-as-judge scores are flat or improved. Users immediately complain that the product feels worse. The post-mortem reveals a subtle regression in a behaviour the judge model was never asked about, or a regression in how the model was answering that the judge couldn’t see because the judge was a sibling of the model being judged.

This scene happens often enough that it has become a kind of dark joke in the field. The eval pipeline gave the green light. The users gave the red light. The eval pipeline was wrong. Now what?

The “now what” is the subject of this post. There are eval pipelines that actually catch regressions before users do. They exist. The teams who run them — Anthropic on its own model releases, Anysphere on Cursor’s agent updates, the in-house ML eval groups at the big labs — do not talk about them in vendor blog posts because there is nothing to sell. The pipelines are mostly human curation, tight golden sets, hard ship/rollback rules, and a deep suspicion of the LLM-as-judge shortcut.

The trap, stated cleanly

The LLM-as-judge pattern is seductive because it is cheap. You write a prompt that asks a strong model “given this question and this answer, rate the answer on a scale of 1–5 for helpfulness, accuracy, and tone.” You run it across 500 examples. You get a number. The number is reproducible. The dashboard goes green.

The problem, well-documented by now in research from Anthropic and several academic groups, is that LLM judges are systematically biased toward outputs that resemble their own. Judge a Claude output with Claude and the score is too high. Judge a GPT output with Claude and the relative ordering can flip depending on stylistic surface features. Judges anchor on length, on confident tone, on the presence of headers and bullet points. They miss subtle factual errors when the surrounding prose is well-organised. They reward sycophantic agreement with the question’s framing.

The two columns rarely overlap. An eval pipeline that asks the judge to score “helpfulness” is sampling the left column and assuming it predicts the right. It usually doesn’t.

This is not a reason to never use LLM-as-judge. It is a reason to treat it as one weak signal among several, never as the gate that ships or rolls back a release.

What real eval pipelines actually look like

Cross-referencing what is publicly known about eval practice at Anthropic (from the Claude 3 model card and subsequent releases), at Anysphere (from Cursor’s release notes and engineering posts), and at the more careful in-house teams I’ve talked to, three components show up consistently. None of them is LLM-as-judge as the primary gate.

Component 1: Hand-curated golden sets, organised by capability. Not one big golden set — many small focused ones. “Can the agent correctly extract structured fields from invoices.” “Does the model refuse this category of harmful request.” “Does the codegen agent fix the off-by-one bug in this specific function class.” Each set is 50–500 examples, written by humans who know the product, with the expected output specified in detail. The golden answer is not “good” or “bad” — it is the actual string or structure the system is expected to produce.

Component 2: Hard pass/fail metrics tied to ship/rollback. The eval pipeline does not produce a “quality score.” It produces a table: “of the 500 golden examples in the codegen category, the new candidate matches the expected output on 423 (84.6%). The previous version matches 441 (88.2%). This is a regression. Do not ship without explicit override.” The metric is the percentage of exact matches against the golden output, or whatever the domain-appropriate strict comparison is. There is no judge in the loop.

Component 3: Human raters for ambiguous categories. For categories where exact-match is too brittle — open-ended explanation, creative writing, tone — the pipeline routes a random sample to a panel of human raters who do side-by-side comparison (“which answer do you prefer, A or B?”) without knowing which model produced which. Aggregate human preference beats LLM-as-judge for catching tone and personality regressions, every time. The cost is real (a few cents to a few dollars per rating depending on category), but the signal quality is qualitatively different.

The shape, drawn out:

Three independent measurement streams converge on one explicit decision. None of them is an LLM rating an LLM. The decision rule is written down before the release, not negotiated after the numbers come in.

The discipline that ties this together is deciding the ship/rollback rule before running the eval. The rule looks like “ship if no golden set regresses by more than 1%, latency p95 is within 10%, and human preference is at least neutral; otherwise roll back or hold.” Writing the rule in advance is what prevents post-hoc rationalisation when the numbers are inconvenient.

The specific failure modes of LLM-as-judge, and what to do instead

It is worth being concrete about which failures LLM-as-judge misses, because the alternative depends on the failure type.

Factuality regressions. The judge is not a fact-checker. It will rate a confidently-stated wrong answer higher than an honestly-stated “I don’t know.” What to do instead: a golden set with the specific correct answer for each example, scored by exact-match or by a deterministic extractor (regex, named-entity match). For domains where the ground truth is checkable (math, code execution, structured data extraction), execute the result and compare to expected output.

Tone and personality drift. The judge will not notice that the model has become 12% more hedgy and 8% more sycophantic. Those changes show up as small score deltas hidden in the noise. What to do instead: human side-by-side preference ratings, ideally with the same panel over multiple releases so you can detect drift longitudinally. The Cursor team publicly discusses this in their changelog whenever a model backend swaps; the change in feel is what users notice first.

Format regressions. The judge will not penalise outputs where the markdown is broken, the code blocks are missing language tags, or the JSON has trailing commas. The judge sees these as content, not format. What to do instead: deterministic format validators run on every eval example. A markdown linter, a JSON parser, a code-block syntax highlighter — these catch format regressions in O(milliseconds) per example and they catch them every time.

Loss of capability. A new model might be slightly better on average and dramatically worse on the specific capability your users rely on most. The judge sees the average; users see the specific. What to do instead: segment your eval set by capability and your users’ usage distribution. If 20% of user requests are “rewrite my code in Python,” that capability is 20% of your eval weight, not 1%. The eval set must reflect the production traffic mix.

Adversarial robustness. Judges are not red-teamers. They will not actively try to find edge cases. What to do instead: a separate red-team eval set, refreshed quarterly, that tries hard to break the system in known failure modes — prompt injection, jailbreaks, role confusion. This set should regress down when the team is doing its job.

A model the rest of us can borrow

The pattern is reproducible at smaller scale. The minimum viable eval pipeline for a serious production agent looks like:

One golden set per critical capability, 50–200 hand-written examples each. Bonus points for grounding the examples in real production transcripts (suitably anonymised).
Exact-match or domain-appropriate strict comparison as the primary metric on each golden set.
A budget regression check on latency and cost. Trivial to measure; teams skip it and then ship 30%-slower releases without noticing.
A small human-rater panel — three to five raters, drawn from the team, doing side-by-side blind comparisons of ~50 examples per release on the ambiguous categories.
One LLM-as-judge run, treated as a tiebreaker and noise check only. Useful for sanity-checking the magnitude of the change. Never the decision gate.
A written ship rule that names which signals matter and what thresholds trigger rollback. Reviewed with the team before the release, not after.

The whole pipeline runs in CI on every model or prompt change. The decision is made by the rule, not by negotiation. The overhead per release is hours, not weeks, after the golden sets are built.

Why most teams don’t do this

The honest answer: the work of building the golden sets is tedious, and the LLM-as-judge shortcut feels productive in a way that hand-curating 500 examples does not. Teams choose the easier path, get a green dashboard, ship a regression, blame “AI being unpredictable,” and continue.

The teams that have been around long enough to ship multiple generations of model updates have learned the bitter lesson. The golden sets are an asset. They get reused across model releases, across prompt iterations, across years. The cost of building them amortises down to zero. The cost of not building them is the random shipping-of-regressions that erodes user trust faster than any feature roadmap can rebuild it.

What to take away

LLM-as-judge is not the eval pipeline. It is one weak, noisy signal that should never be the gate that ships or rolls back a release.
Hand-curated golden sets with strict comparisons are the asset. Build them per capability, weight them by your actual production traffic mix, and reuse them across releases.
Human raters catch what nothing else catches — tone, personality, the gestalt of the output. They are expensive and worth it for the categories where exact-match fails.
Write the ship rule before the eval runs. This single habit prevents most post-hoc rationalisation.
The boring discipline compounds. Teams that build the pipeline once, treat the golden sets as a long-lived asset, and trust the rule over the vibe — they ship the most reliable systems.

The mental model that helps: your eval pipeline is your agent’s immune system. When it’s good, it catches the regressions before they reach users, quietly and constantly. When it’s bad, it gives you false confidence right up until the user forum lights up. The teams that have made peace with the tedious-but-reliable version do not have the most elegant eval dashboards. They have the agents that don’t regress.

Further reading: Anthropic’s Claude 3 model card sketches the structure of their internal eval practice; subsequent model releases follow the same pattern. The OpenAI evals repo is a useful template for the mechanics. For the academic angle on LLM-as-judge limitations, see Zheng et al., “Judging LLM-as-a-Judge”.