datarekha
RAG April 18, 2026

RAG evaluation in production: Ragas, Phoenix, Braintrust

RAG evaluation is harder than LLM evaluation. Retrieval and generation fail for different reasons and need different metrics. Two years after Ragas formalized the four-metric core, the tooling has split into three camps — here's what each one actually buys you.

13 min read · by datarekha · rag-evalsragasphoenixbraintrust

There is a recurring conversation in AI engineering circles, usually held around 6pm at a meetup, that goes like this. “We shipped our RAG system. Everyone loves the demos. We tried to write evals and realized we have no idea what to measure. The retriever sometimes returns garbage, the LLM sometimes hallucinates from good retrievals, and the user feedback is too sparse to be a signal. How do you actually evaluate this thing?”

The answer is more boring than people want. RAG evaluation is not a single number; it’s a set of metrics, applied independently to retrieval and generation, computed against an eval set that someone on your team had to build by hand. The tools are useful, but they are mostly bookkeeping over a process that requires actual eval-set engineering. The teams that ship reliable RAG are the ones who internalize this and budget for the unglamorous data-labeling work.

This post is about the structure of RAG evaluation in 2026, which metrics actually carry the signal, and how the three dominant tools — Ragas, Phoenix, and Braintrust — fit into a production eval pipeline.

Why LLM evals don’t transfer

The naive instinct for evaluating a RAG system is to evaluate it the same way you evaluate a standalone LLM: write 100 questions, write 100 expected answers, grade the system’s answers against them. This works approximately not at all for RAG, for two reasons.

First, RAG has two failure modes, and the obvious LLM metric — “is the final answer right?” — collapses them. A wrong answer might mean the retriever pulled the wrong chunks (in which case the LLM did its job correctly with bad evidence), or the retriever pulled the right chunks and the LLM hallucinated anyway. These are different bugs with different fixes. You need to know which.

Second, RAG outputs are typically grounded in specific evidence, which means the evaluation is not just “is the answer right” but “is the answer right and supported by the retrieved context.” Even a correct answer that doesn’t follow from the retrieved chunks is a bug — it’s the model substituting its training data for retrieval, which means RAG isn’t doing the job you bought it for.

The fix, codified in Ragas back in 2023 and adopted essentially everywhere since, is to split the evaluation into two stages with disjoint metrics:

RETRIEVAL METRICSGENERATION METRICSContext Precisionare retrieved chunks ranked right?Context Recalldoes retrieval cover the answer?MRR / Hit Rateclassical IR metricsjudges the retrieverFaithfulnessis the answer supported by context?Answer Relevancydoes the answer match the question?Groundednessclaims trace to specific chunksjudges the generation
The split that defined modern RAG evaluation. Retrieval metrics tell you whether the right evidence reached the LLM; generation metrics tell you whether the LLM did the right thing with it. Most production debugging starts with figuring out which side of the line is broken.

The metrics that actually carry signal

There are roughly a dozen RAG metrics in the literature; in production, four do most of the work.

Context Precision — is retrieval ranked correctly?

Of the top-K chunks returned by the retriever, how many are actually relevant to the query, and are the relevant ones near the top? This is the classic IR precision-at-K, applied per query. When this number is low, your reranker isn’t doing its job (or you don’t have one).

Context Recall — did retrieval find the evidence at all?

For each ground-truth answer, what fraction of the supporting information is present in the retrieved chunks? This is the metric that catches the “retriever missed the right document entirely” failure mode. When recall is low, your chunking, embedding, or hybrid-search configuration is the problem.

The catch with context recall is that it requires ground-truth labels — for each eval question, you need to know which corpus chunks contain the answer. This is the most expensive label to collect and the main reason production teams under-invest in context recall. The reward, when you do invest, is the clearest possible signal about retriever quality.

Faithfulness — does the answer follow from the context?

Decompose the model’s answer into individual claims. For each claim, check whether the retrieved chunks support it. The fraction of supported claims is faithfulness. When this is low, the LLM is hallucinating or pulling from training data. This is the metric you want gating production deployments — high faithfulness with low groundedness is a hallucination problem; high faithfulness with low recall is a coverage problem.

Answer Relevancy — does the answer match the question?

The model’s answer might be 100% faithful to the retrieved chunks but completely off-topic relative to the user’s question. Answer relevancy measures the alignment between question and answer, independent of evidence. It’s the easiest metric to compute (no ground truth needed, just an LLM judge), and the fastest to get a directional signal from.

These four — context precision, context recall, faithfulness, answer relevancy — are the irreducible core. The Ragas docs list more, the Phoenix evals page adds observability-focused ones, the Confident AI RAG metrics guide catalogs the entire space — but the four above are what every production team I’ve seen actually tracks.

The three tools, what they buy you

Three frameworks dominate the production conversation. They disagree less about what to measure than about how to package the measurement work.

Ragas — the open-source reference

Ragas was the first to crystallize the “reference-free RAG evaluation” pattern: compute the four core metrics with LLM-as-judge, no ground truth required for most of them. It’s open source, Python-native, integrates cleanly with LangChain and LlamaIndex, and is what most teams reach for when they start writing RAG evals.

What you get: the metric library, synthetic test-set generation from your corpus, integration with LangSmith for trace-level evaluation. What you don’t get: a UI, dataset management, CI integration, or anything resembling a SaaS workflow. Ragas assumes you’ll build those parts yourself.

Phoenix — observability-first

Arize Phoenix takes a different angle: start with OpenTelemetry-based tracing of your RAG system in production, layer evaluations on top. Phoenix natively integrates Ragas metrics, but the value-add is the observability — visualizations of retrieval distributions, clustering of failure modes, drift detection on production traffic.

Phoenix is what you reach for when you’re in the “we don’t know what to evaluate yet” phase. The clustering and exploration tools are genuinely useful for finding clusters of failures you didn’t know you had. It’s open source with a hosted Arize tier, which makes the build-or-buy decision easier than the pure-SaaS alternatives.

Braintrust — the commercial polish

Braintrust is the most polished commercial RAG evaluation product. Dataset management is first-class, CI integration is opinionated and works, the UI for comparing experiments and gating deployments is the best in the category. It comes with the most RAG scorers out of the box and integrates with everyone’s LLM provider.

The trade-off is cost and lock-in. Braintrust is SaaS, priced accordingly, and gets expensive at high evaluation volumes. Braintrust’s own competitive comparison positions it for teams that value developer experience over infrastructure control — which is honest if not flattering.

THE THREE TOOLS, POSITIONEDRagasopen sourcemetric librarysynthetic Q-A genscript-drivenbest if you haveengineering bandwidthto wire it upPhoenixopen core + Arize SaaSOTel observabilityfailure clusteringprod drift detectionbest when you’restill discoveringwhat to measureBraintrustcommercialdataset-first UXCI gatingexperiment comparebest when you wanta turnkey workflowand have a budget
Three tools, three positions on the same axis: how much of the eval infrastructure you want to operate yourself. The right pick is more about your team’s bandwidth and budget than about which tool is technically superior.

Build vs buy, honestly

The honest decision tree, derived from watching teams pick:

  • Start with Ragas if you have engineers who can write Python glue and ship infrastructure. The metric implementations are open and inspectable, you control the costs, and you can wire the outputs into whatever observability stack you already run.
  • Reach for Phoenix if you’re in the discovery phase, your production traces are messy, and you need to find the failure modes before you can measure them. The clustering and trace exploration capabilities are genuinely differentiated.
  • Pay for Braintrust if your eval pipeline needs to gate CI/CD, your team values the polished developer experience over the cost, and you don’t want to operate eval infrastructure.

Plenty of teams use more than one. A common pattern: Phoenix for production observability and incident-time exploration, Ragas (or Ragas inside LangSmith) for the offline eval pipeline. Or Braintrust for the gated CI evaluations and Phoenix for the unstructured exploration.

The eval-set bootstrap (the actual hard part)

The tool you pick matters less than the eval set you build, and the eval set is where most teams under-invest. The pattern that has emerged as the production default is a three-stage bootstrap:

Stage 1 — synthetic Q-A generation. For each document (or chunk) in your corpus, prompt an LLM to generate plausible questions a user might ask, paired with the ground-truth chunk that answers each. Ragas has a synthetic test-set generation module that does this well, but the technique is straightforward to implement directly. The result is a “silver” eval set: thousands of synthetic Q-A pairs at near-zero cost.

Stage 2 — human review. Subject-matter experts spot-check the synthetic pairs. Throw out the questions that are nonsensical, ambiguous, or already obvious. Fix the ground-truth chunks where the LLM picked the wrong one. The output is a “golden” eval set — typically 200-500 examples, hand-curated. This is the labor-intensive part.

Stage 3 — live-traffic augmentation. As real users ask questions, sample from the production traffic, send the interesting ones through the same review process, and grow the eval set over time. Production queries reveal the long tail of weird inputs that synthetic generation will never cover.

Dylan Castillo’s guide on synthetic eval bootstrapping walks through this pattern in detail; it’s a tight summary of what the production teams I’ve talked to have converged on.

EVAL-SET BOOTSTRAP (silver → gold → live)Stage 1: silverLLM-generated Q-Afrom your corpus~thousands of pairsStage 2: goldenSME review +curation~200-500 examplesStage 3: livesampled prod traffic+ ongoing curationgrows over timethe golden set is what gates deployments;the live set is what tells you the golden set is going stale.
The bootstrap pattern that has converged across the production RAG teams I’ve seen. The silver set lets you iterate quickly, the golden set is the one you gate releases on, the live sampling keeps both honest as your traffic evolves.

A working gate for deployments

A useful production pattern: tie a small set of gates to the golden eval set, and refuse to ship if any gate regresses. The default gates I’ve seen most often:

  • Context recall ≥ 0.9 on the golden set. The retriever must find the evidence on 90% of evaluated questions.
  • Faithfulness ≥ 0.95. The model must not invent claims unsupported by the context.
  • Answer relevancy ≥ 0.85. The answer must address the question.
  • Latency p95 within 20% of baseline. Eval set rerun for performance regressions, not just quality.

The specific thresholds are workload-dependent. The principle is that without numeric gates, every release becomes a vibes-based decision and quality regresses silently. With gates, a regression on context recall forces a conversation about why before the release ships.

What to take away

Three lines, from teams who’ve shipped real RAG evaluation:

  • Evaluate retrieval and generation separately. The metrics exist for a reason — context precision/recall for the retriever, faithfulness/answer relevancy for the generator. If you only measure end-to-end accuracy, you can’t tell which side to fix when it regresses.
  • The eval set is the work, not the tool. Pick Ragas, Phoenix, or Braintrust based on your team’s bandwidth and budget — but the harder, less glamorous work is building the golden eval set through synthetic generation plus human review. Most of the variance in production RAG quality across teams traces to the quality of their eval set, not the framework they evaluate it with.
  • Gate deployments on numeric thresholds. Without gates, quality drifts silently. Faithfulness ≥ 0.95 is the gate that blocks hallucinations from shipping; context recall ≥ 0.9 is the gate that blocks retrieval regressions.

The 2026 take is the same as the 2024 take, just louder: RAG evaluation is not a side-project, it’s the production discipline that separates teams whose RAG systems get better over time from teams whose RAG systems stay impressive in demos and break in production. Spend the eval budget on the dataset. The tool will matter less than you think.


Further reading: Ragas documentation, Arize Phoenix evals docs, the Braintrust RAG tools comparison, the Confident AI RAG metrics guide, and Dylan Castillo’s synthetic eval data post.

Skip to content