What biases affect LLM-as-a-judge evaluations, and how do you mitigate position bias?

LLM judges suffer from position bias (favoring a given ordering), verbosity bias (preferring longer answers), and self-enhancement bias (favoring their own style); position bias alone can cause large inconsistency. Mitigations include evaluating both orderings and only counting consistent wins, using rubrics, averaging multiple judges, and calibrating against human labels.

How do you evaluate LLM outputs, and what is LLM-as-a-judge?

LLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.

How do you evaluate the quality of an LLM or RAG system?

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

LLM evals & LLM-as-judge — Generative AI

“It seems better” is how most LLM features are shipped, and it’s how most of them quietly regress. The moment you change a prompt, swap a model, or tweak retrieval, something that used to work breaks — and you won’t know unless you measure. Evaluation has gone from a research afterthought to the production gate, and being able to build a real eval suite is now one of the most job-relevant LLM skills there is.

Evals run in three places

Offline — a fixed dataset of inputs with known-good outputs (or rubrics). You run it whenever you change anything: prompt, model, parameters. This is your regression net.
In CI — the offline suite as a gate: a pull request that drops the score below a threshold doesn’t merge. Same idea as unit tests, for model behavior.
Online — metrics on live traffic: thumbs up/down, task success, escalation rate, latency. The ground truth your offline set only approximates.

What kind of metric?

Match the metric to the task:

Deterministic / exact — when there’s a right answer: exact match, JSON validity, regex, did-it-call-the-right-tool, unit tests for generated code. Cheap, fast, unambiguous. Use these whenever you can.
Reference-based — compare to a gold answer (embedding similarity, ROUGE). Rough, but useful for summarization-style tasks.
LLM-as-judge — for open-ended quality (helpfulness, tone, faithfulness) where no exact metric exists, ask a strong model to grade against a rubric.

LLM-as-judge: powerful, and full of traps

A capable model grading outputs agrees with human raters a surprising amount of the time — often as well as two humans agree with each other — and it scales to thousands of examples for cents. But an unguarded judge has systematic biases that quietly poison your scores. The most notorious is position bias: a pairwise judge tends to favor whichever answer it sees first. Keep the two answers identical and just swap their order, and the verdict can flip:

The fixes for that flip are the standard playbook:

Give it a rubric. “Rate 1–5 on factual accuracy, then on clarity” beats “is this good?” A concrete rubric shrinks bias and noise dramatically.
Average over both orderings (or randomize position) to cancel position bias.
Validate the judge against human labels. A judge can be perfectly consistent and still consistently wrong — so spot-check its agreement with humans before you trust it.

A minimal eval harness

The shape of every eval suite: a dataset, a scorer, an aggregate, a threshold.

# A tiny offline eval set: inputs with a deterministic check.
cases = [
    {"q": "2+2", "expected": "4"},
    {"q": "capital of France", "expected": "Paris"},
    {"q": "is 7 prime?", "expected": "yes"},
    {"q": "color of the sky", "expected": "blue"},
]

# Two candidate "model versions" (canned outputs) to compare.
v1 = {"2+2": "4", "capital of France": "Paris", "is 7 prime?": "no",  "color of the sky": "grey"}
v2 = {"2+2": "4", "capital of France": "Paris", "is 7 prime?": "yes", "color of the sky": "blue"}

def score(outputs):
    hits = sum(outputs[c["q"]].lower() == c["expected"].lower() for c in cases)
    return hits / len(cases)

s1, s2 = score(v1), score(v2)
print(f"v1 accuracy: {s1:.0%}")
print(f"v2 accuracy: {s2:.0%}")
threshold = 0.75
for name, s in [("v1", s1), ("v2", s2)]:
    print(f"  {name}: {'PASS' if s >= threshold else 'FAIL'} (gate = {threshold:.0%})")
print("\nThis is what runs in CI on every prompt/model change.")

v1 accuracy: 50%
v2 accuracy: 100%
  v1: FAIL (gate = 75%)
  v2: PASS (gate = 75%)

This is what runs in CI on every prompt/model change.

That FAIL is the whole point. v1 gets two of four cases wrong (the prime check and the sky colour), so it scores 50% and trips the 75% gate; v2 fixes both and scores 100%. Wire that gate into CI and a pull request that quietly drops accuracy simply cannot merge — the regression is caught before a user ever sees it, no vibes required.

In one breath

“It seems better” is how features quietly regress — evals turn that into a measurement.
Evals run in three places: offline (regression net), in CI (a merge gate), and online (live-traffic ground truth).
Match the metric to the task: deterministic/exact when there’s a right answer, reference-based for fuzzy matches, LLM-as-judge only for open-ended quality.
An unguarded judge is biased — position, verbosity, self-preference, leniency drift — so use a rubric, average both orderings, and don’t judge a model with its own family.
A judge can be perfectly consistent and still wrong: validate against human labels before you trust it.
Every harness is the same shape: dataset → scorer → aggregate → threshold.

Quick check

0/3

Q1When should you prefer a deterministic metric over an LLM-as-judge?

Q2What is position bias in an LLM judge, and how do you control it?

Q3Why validate an LLM judge against human labels even if it's consistent?

Evals are how you know anything works. Two close relatives: the science of hallucination & grounding (measuring faithfulness specifically), and RAG evaluations for retrieval systems.

LLM evals & LLM-as-judge

What you'll learn

Before you start

Evals run in three places

What kind of metric?

LLM-as-judge: powerful, and full of traps

A minimal eval harness

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further