How do you evaluate the quality of an LLM or RAG system?
Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.
How to think about it
A RAG system has two failure modes: the retriever returns irrelevant chunks, or the generator hallucinates despite good retrieval. Evaluation must cover both independently.
Retrieval metrics
| Metric | Definition |
|---|---|
| Context Precision | Fraction of retrieved chunks that are actually relevant |
| Context Recall | Fraction of relevant information that is covered by retrieved chunks |
| MRR (Mean Reciprocal Rank) | Position of first relevant chunk |
| NDCG@k | Graded relevance across top-k retrieved chunks |
Generation metrics
| Metric | Definition |
|---|---|
| Faithfulness | Is every claim in the answer supported by a retrieved chunk? |
| Answer Relevance | Does the answer address the question? |
| Answer Correctness | Is the answer factually right (vs. a gold reference)? |
| Hallucination Rate | Fraction of answers containing unsupported claims |
RAGAS — automated LLM-as-judge
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
data = {
"question": ["What is the refund policy?"],
"answer": ["Refunds are processed within 7 business days."],
"contexts": [["Our refund policy allows returns within 30 days..."]],
"ground_truth": ["Refunds take up to 7 business days after approval."],
}
result = evaluate(Dataset.from_dict(data), metrics=[
faithfulness, answer_relevancy, context_precision
])
print(result)
Online evaluation
Instrument production with:
- User thumbs-up / thumbs-down signals.
- Citation grounding — did the model cite chunks that actually contain the claim?
- Latency and cost per query (tokens in + out).
Ground truth construction
Bootstrap a labeled set by: (1) sampling 200–500 real queries, (2) having SMEs label relevant chunks and gold answers, (3) using those to calibrate LLM-as-judge prompts so they correlate with human scores (Pearson r above 0.85 is a reasonable bar).