datarekha

LLM evals & LLM-as-judge

Vibes don't ship. How to build eval suites that catch regressions, when to use an LLM as a grader, and how to stop position, verbosity, and self-preference bias from poisoning your scores.

8 min read Intermediate Generative AI Lesson 18 of 33

What you'll learn

  • The three places evals run — offline, in CI, and online on live traffic
  • When to use exact metrics vs an LLM-as-judge, and how reliable judges are
  • The biases that poison LLM judges (position, verbosity, self-preference) and how to control them

Before you start

“It seems better” is how most LLM features are shipped, and it’s how most of them quietly regress. The moment you change a prompt, swap a model, or tweak retrieval, something that used to work breaks — and you won’t know unless you measure. Evaluation has gone from a research afterthought to the production gate, and being able to build a real eval suite is now one of the most job-relevant LLM skills there is.

Evals run in three places

  • Offline — a fixed dataset of inputs with known-good outputs (or rubrics). You run it whenever you change anything: prompt, model, parameters. This is your regression net.
  • In CI — the offline suite as a gate: a pull request that drops the score below a threshold doesn’t merge. Same idea as unit tests, for model behavior.
  • Online — metrics on live traffic: thumbs up/down, task success, escalation rate, latency. The ground truth your offline set only approximates.

What kind of metric?

Match the metric to the task:

  • Deterministic / exact — when there’s a right answer: exact match, JSON validity, regex, did-it-call-the-right-tool, unit tests for generated code. Cheap, fast, unambiguous. Use these whenever you can.
  • Reference-based — compare to a gold answer (embedding similarity, ROUGE). Rough, but useful for summarization-style tasks.
  • LLM-as-judge — for open-ended quality (helpfulness, tone, faithfulness) where no exact metric exists, ask a strong model to grade against a rubric.

LLM-as-judge: powerful, and full of traps

A capable model grading outputs agrees with human raters a surprising amount of the time — often as well as two humans agree with each other — and it scales to thousands of examples for cents. But an unguarded judge has systematic biases that quietly poison your scores. The most notorious is position bias: a pairwise judge tends to favor whichever answer it sees first. Watch the verdict flip when you swap the order:

The fixes you just used are the standard playbook:

  • Give it a rubric. “Rate 1–5 on factual accuracy, then on clarity” beats “is this good?” A concrete rubric shrinks bias and noise dramatically.
  • Average over both orderings (or randomize position) to cancel position bias.
  • Validate the judge against human labels. A judge can be perfectly consistent and still consistently wrong — so spot-check its agreement with humans before you trust it.

A minimal eval harness

The shape of every eval suite: a dataset, a scorer, an aggregate, a threshold.

Quick check

Quick check

0/3
Q1When should you prefer a deterministic metric over an LLM-as-judge?
Q2What is position bias in an LLM judge, and how do you control it?
Q3Why validate an LLM judge against human labels even if it's consistent?

Next

Evals are how you know anything works. Two close relatives: the science of hallucination & grounding (measuring faithfulness specifically), and RAG evaluations for retrieval systems.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you evaluate LLM outputs, and what is LLM-as-a-judge?

LLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.

What biases affect LLM-as-a-judge evaluations, and how do you mitigate position bias?

LLM judges suffer from position bias (favoring a given ordering), verbosity bias (preferring longer answers), and self-enhancement bias (favoring their own style); position bias alone can cause large inconsistency. Mitigations include evaluating both orderings and only counting consistent wins, using rubrics, averaging multiple judges, and calibrating against human labels.

How do you evaluate the quality of an LLM or RAG system?

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Related lessons

Explore further

Skip to content