LLM evals & LLM-as-judge
Vibes don't ship. How to build eval suites that catch regressions, when to use an LLM as a grader, and how to stop position, verbosity, and self-preference bias from poisoning your scores.
What you'll learn
- The three places evals run — offline, in CI, and online on live traffic
- When to use exact metrics vs an LLM-as-judge, and how reliable judges are
- The biases that poison LLM judges (position, verbosity, self-preference) and how to control them
Before you start
“It seems better” is how most LLM features are shipped, and it’s how most of them quietly regress. The moment you change a prompt, swap a model, or tweak retrieval, something that used to work breaks — and you won’t know unless you measure. Evaluation has gone from a research afterthought to the production gate, and being able to build a real eval suite is now one of the most job-relevant LLM skills there is.
Evals run in three places
- Offline — a fixed dataset of inputs with known-good outputs (or rubrics). You run it whenever you change anything: prompt, model, parameters. This is your regression net.
- In CI — the offline suite as a gate: a pull request that drops the score below a threshold doesn’t merge. Same idea as unit tests, for model behavior.
- Online — metrics on live traffic: thumbs up/down, task success, escalation rate, latency. The ground truth your offline set only approximates.
What kind of metric?
Match the metric to the task:
- Deterministic / exact — when there’s a right answer: exact match, JSON validity, regex, did-it-call-the-right-tool, unit tests for generated code. Cheap, fast, unambiguous. Use these whenever you can.
- Reference-based — compare to a gold answer (embedding similarity, ROUGE). Rough, but useful for summarization-style tasks.
- LLM-as-judge — for open-ended quality (helpfulness, tone, faithfulness) where no exact metric exists, ask a strong model to grade against a rubric.
LLM-as-judge: powerful, and full of traps
A capable model grading outputs agrees with human raters a surprising amount of the time — often as well as two humans agree with each other — and it scales to thousands of examples for cents. But an unguarded judge has systematic biases that quietly poison your scores. The most notorious is position bias: a pairwise judge tends to favor whichever answer it sees first. Watch the verdict flip when you swap the order:
The fixes you just used are the standard playbook:
- Give it a rubric. “Rate 1–5 on factual accuracy, then on clarity” beats “is this good?” A concrete rubric shrinks bias and noise dramatically.
- Average over both orderings (or randomize position) to cancel position bias.
- Validate the judge against human labels. A judge can be perfectly consistent and still consistently wrong — so spot-check its agreement with humans before you trust it.
A minimal eval harness
The shape of every eval suite: a dataset, a scorer, an aggregate, a threshold.
Quick check
Quick check
Next
Evals are how you know anything works. Two close relatives: the science of hallucination & grounding (measuring faithfulness specifically), and RAG evaluations for retrieval systems.
Practice this in an interview
All questionsLLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.
LLM judges suffer from position bias (favoring a given ordering), verbosity bias (preferring longer answers), and self-enhancement bias (favoring their own style); position bias alone can cause large inconsistency. Mitigations include evaluating both orderings and only counting consistent wins, using rubrics, averaging multiple judges, and calibrating against human labels.
Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.
LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.