datarekha
NLP & LLMs Medium Asked at CohereAsked at DatabricksAsked at Anthropic

How do you evaluate the quality of an LLM or RAG system?

The short answer

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

How to think about it

A RAG system has two failure modes: the retriever returns irrelevant chunks, or the generator hallucinates despite good retrieval. Evaluation must cover both independently.

Retrieval metrics

MetricDefinition
Context PrecisionFraction of retrieved chunks that are actually relevant
Context RecallFraction of relevant information that is covered by retrieved chunks
MRR (Mean Reciprocal Rank)Position of first relevant chunk
NDCG@kGraded relevance across top-k retrieved chunks

Generation metrics

MetricDefinition
FaithfulnessIs every claim in the answer supported by a retrieved chunk?
Answer RelevanceDoes the answer address the question?
Answer CorrectnessIs the answer factually right (vs. a gold reference)?
Hallucination RateFraction of answers containing unsupported claims

RAGAS — automated LLM-as-judge

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

data = {
    "question": ["What is the refund policy?"],
    "answer": ["Refunds are processed within 7 business days."],
    "contexts": [["Our refund policy allows returns within 30 days..."]],
    "ground_truth": ["Refunds take up to 7 business days after approval."],
}
result = evaluate(Dataset.from_dict(data), metrics=[
    faithfulness, answer_relevancy, context_precision
])
print(result)

Online evaluation

Instrument production with:

  • User thumbs-up / thumbs-down signals.
  • Citation grounding — did the model cite chunks that actually contain the claim?
  • Latency and cost per query (tokens in + out).

Ground truth construction

Bootstrap a labeled set by: (1) sampling 200–500 real queries, (2) having SMEs label relevant chunks and gold answers, (3) using those to calibrate LLM-as-judge prompts so they correlate with human scores (Pearson r above 0.85 is a reasonable bar).

Keep practising

All NLP & LLMs questions

Explore further

Skip to content