datarekha
RAG May 10, 2026

Contextual Retrieval, two years on

Anthropic's September 2024 trick — prepend a sentence of context to each chunk before embedding — quietly became the new RAG default. Here's why a 49% retrieval-error reduction held up, what the numbers actually say, and when you still shouldn't bother.

11 min read · by datarekha · ragretrievalanthropicembeddingsbm25

When Anthropic published Contextual Retrieval in September 2024, the technique was almost embarrassingly simple. Before you embed a chunk, you ask the model to write a one-paragraph summary of where this chunk sits in the larger document — what report it’s from, what section, what entity is being discussed — and you prepend that paragraph to the chunk. Then you embed the combined text. Then you index it. That’s the whole idea.

The reason it caught on isn’t the cleverness; it’s the numbers. 35% fewer retrieval failures with embeddings alone, 49% when combined with BM25, and 67% when you also add a reranker. Two years later those numbers have held up across enough independent replications — Vespa, LlamaIndex, every RAG-shaped startup’s internal benchmark — that “did you try contextual retrieval?” has become the first question you ask when someone’s RAG system is misbehaving.

This post is the field report. What the original paper actually says, what’s held up, where the technique breaks, and what it costs at scale.

The problem it solves

RAG’s central failure mode is that you chunk a document, and the chunks lose the context that made them meaningful. The original blog post used the example of a financial filing: a chunk that reads “The company’s revenue grew by 3% over the previous quarter” is essentially useless out of context. Which company? Which quarter? When the user asks “what was ACME Corp’s Q2 2023 revenue growth”, the embedding for that chunk won’t be near the embedding for the query, because the chunk doesn’t mention ACME or Q2 2023.

You can throw bigger chunks at the problem — and most teams do, badly — but that hurts retrieval precision: now the embedding has to encode fifteen different facts, and the relevance signal smears across all of them. The honest fix is to put the context back into the chunk before you embed.

PLAIN CHUNKINGCONTEXTUAL CHUNKINGchunk_47:“The company’s revenuegrew by 3% over theprevious quarter.”query: “ACME Q2 2023revenue growth”match score: 0.31 (miss)chunk_47 + context:“From ACME Corp 10-Q,Q2 2023 financials section.""The company’s revenuegrew by 3% over Q1 2023.”same querymatch score: 0.84 (hit)
The mechanism in one picture. A chunk-specific 1-2 sentence preamble re-anchors the embedding back to the entities and timeframe that named-entity queries actually mention.

What the original paper measured

Anthropic ran an evaluation across nine corpora — code documentation, fiction, science papers, news — using top@20 retrieval as the metric. Their reported lift is the headline that everyone quotes:

  • Contextual Embeddings alone: 35% reduction in retrieval failures (5.7% → 3.7%).
  • Contextual Embeddings + Contextual BM25: 49% reduction (5.7% → 2.9%).
  • Contextual Embeddings + Contextual BM25 + reranker: 67% reduction (5.7% → 1.9%).

The combination matters. Pure dense embeddings alone don’t capture exact strings — codes, identifiers, names — which is exactly the territory BM25 owns. The two retrievers fail in complementary ways, and the context preamble lifts both. Then a cross-encoder reranker (Cohere Rerank or Voyage Rerank in the original paper) picks the best 20 out of the top 150 from the hybrid retrieval. The 67% number is the stack, not any single layer.

The second number worth quoting: the per-chunk context generation, when done with prompt caching, costs roughly $1.02 per million document tokens on Haiku. For a corpus of 100M tokens — a serious enterprise documentation set — that’s about $102 to index. Without caching it would be 10× to 100× higher because you’d be re-reading the full document for every chunk.

Why prompt caching is the unsung hero

The prompt the original blog used looks roughly like this:

<document>
{{WHOLE_DOCUMENT}}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>

Please give a short succinct context to situate this chunk within the
overall document for the purposes of improving search retrieval. Answer
only with the succinct context and nothing else.

The whole document is the prefix that you reuse for every chunk in that document — and Anthropic’s prompt caching can cache prefixes up to 200K tokens, with a 90% discount on cached reads. For a 50K-token document with 100 chunks, you pay full-price for the first chunk and 10% for the next 99. That’s the difference between “this is a research-paper trick” and “this is cheap enough that you should just do it.”

Two years on, every major lab supports prefix caching with similar economics — OpenAI auto-caches, Gemini has explicit caching, DeepSeek’s caching is even more aggressive. The technique is portable. The operating budget is roughly:

  • One-time index cost: $1-2 per million document tokens.
  • Storage cost: ~30% more vector storage (because each chunk is 50-100 tokens longer).
  • Query-time cost: zero overhead. Retrieval is the same as plain RAG.

That last point is the killer. Every other “improve RAG” trick — query rewriting, hypothetical document embeddings, multi-hop retrieval — adds query-time latency. Contextual Retrieval moves all the work to the indexing step, which you do once. Production teams love this trade because their latency budget is at the query, not the index.

When it actually wins, and when it doesn’t

Replicators have found three patterns hold consistently:

1. It wins big on documents with anaphora

Anaphora is the linguistic word for “referring backwards” — pronouns, “the company,” “this approach.” The more your corpus relies on references to entities introduced earlier in the document, the more Contextual Retrieval helps. SEC filings, legal contracts, scientific papers, multi-page product manuals: all heavily anaphoric, all show the full 35-49% lift.

2. It barely moves the needle on FAQ-style corpora

If every chunk in your corpus is already a self-contained Q&A pair, or a short standalone snippet (think customer-support knowledge bases written in “one-question, one-answer” format), Contextual Retrieval adds ~5% lift at most. The information was already in the chunk; the preamble just restates it.

The original paper found BM25 + dense embeddings + reranker improved from 5.7% failure to 2.9% (with context) versus 4.3% (with hybrid but no context). That’s a 33% relative reduction on top of hybrid search, which itself was a 25% reduction on top of dense-only. The gains compound, but only when you have the rest of the stack right. A team that doesn’t have BM25 + reranker yet should add those first before adding context generation.

The replications that mattered

Three teams pushed numbers publicly that helped the technique stick:

  • Vespa’s blog reimplemented the technique on their FinanceBench evaluation set and reported ~40% retrieval-failure reduction — close to Anthropic’s headline number on a different corpus. They also packaged it as a Vespa template.
  • LlamaIndex shipped a Contextual Retrieval pack within weeks, with the cost-per-token math baked into the docs. Their default uses Haiku for context generation and Voyage for reranking.
  • DeepLearning.AI’s short course on the technique ran the same evals against open-source corpora (Wikipedia subsets, arXiv abstracts) and reproduced the lift within ~5 percentage points across the board.

What’s notable is that no major published result has failed to reproduce the headline lift on anaphoric corpora. That’s rare for a 2024-vintage RAG technique — most “we got 30% better” claims from that era did not hold up.

Where this fits in the 2026 stack

A modern production RAG stack for an enterprise knowledge base looks roughly like this, with Contextual Retrieval sitting between chunking and indexing:

INDEXING TIMEdocumentsraw corpuschunker400-800 tokenscontext generatorHaiku + prompt cacheembed + BM25dual indexQUERY TIMEuser queryhybrid retrievalBM25 + dense, top 150reranker→ top 20LLM answer
The 2026 default stack. Contextual Retrieval is a one-time indexing tax that lifts every downstream layer. The query path is unchanged from a plain hybrid-search RAG.

What to take away

If you have a RAG system over anaphoric documents — financial filings, contracts, scientific reports, multi-section product docs — and you haven’t added a context preamble to your chunks yet, this is the highest-leverage improvement you can make. The math is unambiguous:

  • One-time cost of about $1 per million document tokens with caching.
  • 35-67% reduction in retrieval failures depending on what else is in your stack.
  • Zero query-time latency overhead.
  • Compounds with BM25, hybrid search, and rerankers — it doesn’t replace them.

The honest counter-case: if your corpus is genuinely already self-contained per chunk — a FAQ, a list of independent product descriptions, atomic Q&A pairs — Contextual Retrieval is mostly ceremony. Spend the engineering time on a reranker instead.

But for everyone else, the trick that looked too simple to be true in 2024 is the new default in 2026, and the burden of proof has reversed: if you’re not doing Contextual Retrieval, you should have a written reason why.


Further reading: Anthropic’s original Contextual Retrieval announcement (September 2024), Vespa’s reproduction on FinanceBench, and the LlamaIndex Contextual Retrieval cookbook. For the broader hybrid-search context, see Microsoft’s RAG benchmark paper.

Skip to content