NLP & LLMs Medium Asked at CohereAsked at DatabricksAsked at Anthropic

How do you evaluate the quality of an LLM or RAG system?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

How to think about it

A RAG system has two failure modes: the retriever returns irrelevant chunks, or the generator hallucinates despite good retrieval. Evaluation must cover both independently.

Retrieval metrics

Metric	Definition
Context Precision	Fraction of retrieved chunks that are actually relevant
Context Recall	Fraction of relevant information that is covered by retrieved chunks
MRR (Mean Reciprocal Rank)	Position of first relevant chunk
NDCG@k	Graded relevance across top-k retrieved chunks

Generation metrics

Metric	Definition
Faithfulness	Is every claim in the answer supported by a retrieved chunk?
Answer Relevance	Does the answer address the question?
Answer Correctness	Is the answer factually right (vs. a gold reference)?
Hallucination Rate	Fraction of answers containing unsupported claims

RAGAS — automated LLM-as-judge

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

data = {
    "question": ["What is the refund policy?"],
    "answer": ["Refunds are processed within 7 business days."],
    "contexts": [["Our refund policy allows returns within 30 days..."]],
    "ground_truth": ["Refunds take up to 7 business days after approval."],
}
result = evaluate(Dataset.from_dict(data), metrics=[
    faithfulness, answer_relevancy, context_precision
])
print(result)

Online evaluation

Instrument production with:

User thumbs-up / thumbs-down signals.
Citation grounding — did the model cite chunks that actually contain the claim?
Latency and cost per query (tokens in + out).

Ground truth construction

Bootstrap a labeled set by: (1) sampling 200–500 real queries, (2) having SMEs label relevant chunks and gold answers, (3) using those to calibrate LLM-as-judge prompts so they correlate with human scores (Pearson r above 0.85 is a reasonable bar).

How do you evaluate the quality of an LLM or RAG system?

Keep practising

Explore further