datarekha
RAG May 12, 2026

Hybrid search, when one retriever isn't enough

Pure vector search misses exact identifiers. Pure BM25 misses paraphrases. Every serious production RAG stack — Notion, Perplexity, Vespa, Glean — runs both and fuses the results. Here's the math, the latency budget, and the production knobs that matter.

12 min read · by datarekha · ragretrievalbm25vector-searchrrf

If you’ve spent any time on AI Twitter, you’ve seen the claim that “vector search has solved retrieval.” Anyone who has actually shipped a RAG system to real users knows this is not true. The system that gleefully retrieves the right paragraph for “how does a transformer learn long-range dependencies” will fail just as gleefully when the user types “what does PSA-2024-117 mean.” Embeddings smooth meaning; they also smooth away the exact identifier you need.

The fix that production teams converged on, almost unanimously, is hybrid search: run a lexical retriever (BM25 or a variant) and a dense vector retriever in parallel, then fuse their ranked lists. Notion runs it. Perplexity runs it. Vespa is built around it. Glean charges enterprise prices for a stack whose central trick is doing this well.

This post is a tour of why hybrid won, the formula everyone uses, and the production knobs that separate a hybrid setup that helps from one that just adds latency.

Where each retriever fails alone

Both BM25 and dense embeddings have well-understood failure modes, and the failures don’t overlap — which is exactly why fusing them works.

Dense vector search fails on exact strings

A dense embedding maps “PSA-2024-117” and “the security advisory from last March” into nearby vectors if and only if the embedding model saw enough training data to associate them. For brand-new identifiers, proper nouns, error codes, SKUs, version strings, or any out-of-distribution token, the model has no idea. It will retrieve documents that talk about security advisories without ever mentioning PSA-2024-117 by name.

This is not a tunable failure. Larger models help marginally. Training domain-specific embeddings helps more, but you have to retrain whenever your identifiers change. BM25, which scores by exact-token overlap, handles this case trivially — it doesn’t care what the identifier means, only that it matches.

BM25 fails on paraphrases

The other direction: a user asks “how do I cancel my subscription” and your docs use the word “terminate.” BM25 sees zero overlap, scores zero. Dense embeddings handle this trivially because “cancel” and “terminate” sit close in vector space.

The classic chart from the BEIR benchmark showed this asymmetry clearly: BM25 dominated on benchmarks with rare proper nouns and codes (TREC-COVID, ArguAna), dense embeddings dominated on benchmarks with paraphrased questions (NQ, HotpotQA). The zero-shot dense retrievers in 2021 actually lost to BM25 on average across 18 benchmarks. The 2024-era models tipped the balance the other way for most corpora, but the asymmetry remained: each retriever still has corpora where it loses badly to the other.

BM25 STRENGTHDENSE STRENGTHexact tokens• PSA-2024-117• SKU JK-9931-A• error E_CONFLICT_433• “Section 230(c)(1)”• Maria Patel-OkonkwoDense embedding: confused.BM25: trivial match.paraphrased meaning• cancel ↔ terminate• fix ↔ resolve ↔ patch• fast ↔ low-latency• “how does X work”• cross-lingual queriesBM25: scores zero overlap.Dense: handles trivially.
The two retrievers fail in complementary ways. Hybrid search exists because there is no single corpus where one of them is sufficient.

Reciprocal Rank Fusion, the formula everyone uses

Once you accept that you need both retrievers, you have to merge their ranked lists. You could try to calibrate the raw scores into a common range, but BM25 scores and cosine similarities live on incomparable scales. The standard answer is Reciprocal Rank Fusion (RRF), which ignores the scores entirely and uses only the ranks.

For a document d that appears at rank r_d^(i) in retriever i’s ranked list, RRF computes:

RRF(d) = Σ_i  1 / (k + r_d^(i))

k is a small constant; the original paper used k = 60 and most production systems still do. A document at rank 1 contributes 1/61 ≈ 0.0164; rank 10 contributes 1/70 ≈ 0.0143; rank 100 contributes 1/160 ≈ 0.00625. The decay is gentle, which is the point — a document at rank 5 in BM25 and rank 8 in dense will outrank one that’s #1 in only one of the lists.

The reason RRF won over linear-weighted alternatives is robustness. You don’t have to tune the weight between retrievers per corpus, because the formula uses ranks not scores. It’s nearly hyperparameter-free — just k, which barely matters in the range 30-100. Vespa, Elasticsearch, OpenSearch, Pinecone, and Weaviate all ship RRF as their default hybrid fusion. The Microsoft Azure AI Search RRF post walks through the implementation; the formula they ship is the same formula from 2009.

user querye.g. “PSA-2024-117”BM25 retrieverinverted indextop 100, 8ms p95dense retrieverHNSW or IVF-PQtop 100, 15ms p95RRF fusionΣ 1/(k + rank)k = 60, 2mstop 20→ rerankerparallel calls≤ 15ms wall-clockrank-only merge
Production hybrid retrieval. Both retrievers fire concurrently; total latency is governed by the slower one (usually dense), not the sum. RRF is so cheap it’s free.

The latency budget at scale

Here’s where production engineering enters the picture. A real-time RAG endpoint targets something like a 200ms p95 server-side budget end-to-end. Roughly:

  • 30-80ms for retrieval (BM25 + dense in parallel + RRF).
  • 30-50ms for reranking the top 100-150 down to top 20.
  • 80-150ms for LLM generation streaming first token.

The retrieval slice is non-negotiable because the rest of the budget can’t move much. Generation is governed by the model provider. Reranking is governed by GPU economics. Retrieval is the only piece you fully control.

Three production knobs determine whether you hit your 30-80ms target:

1. Run both retrievers concurrently, not serially

This sounds obvious but plenty of LangChain-vintage code does it serially because the wrapper API encouraged it. Fire both queries on parallel goroutines / asyncio tasks. Total latency is max(BM25, dense), not the sum.

2. Cap candidate count per retriever

top_k = 100 per retriever feeding RRF, then RRF + reranker shrinks to top 20. Bigger k does not noticeably help recall at the top of the fused list; it only inflates network and reranker cost. The Anthropic Contextual Retrieval paper landed on the same default.

3. Co-locate the indices

If your BM25 lives in Elasticsearch and your vectors live in Pinecone, each cross-region hop is ~30ms of dead time. The teams running hybrid search at scale (Vespa, Glean, Notion) put both indices in the same process or at least the same data center. Vespa was specifically built to host both indices in the same node — a design choice that looked fussy in 2017 and now looks prescient.

Production stories worth reading

A few public deployments worth studying:

  • Notion’s AI search — Notion’s “AI Connectors” indexes a tenant’s workspace plus connected Google Drive / Slack data. The Notion engineering blog discusses combining keyword and semantic search; the hybrid layer was non-optional once they had to retrieve specific person names and document titles.
  • Perplexity’s index — Perplexity blogs that their search backend fuses BM25 and dense retrieval over a constantly-refreshed web crawl. Their reported latency budget for retrieval is sub-100ms even with billions of documents, achieved via sharded indices and parallel BM25 + ANN fanout.
  • Vespa’s tensor + lexical engine — Vespa publishes the deepest engineering writeups of any hybrid system. Their FinanceBench post showed hybrid + reranker doubling recall over either retriever alone, with the cost analysis fully laid out.

What about ColBERT and late-interaction?

A reasonable question: aren’t late-interaction retrievers (ColBERT, ColBERTv2) supposed to give you fine-grained matching without needing BM25? They are, and they do — for the corpora where you can afford the storage cost. We’ll cover that in the late-interaction post, but the short answer is that ColBERT is roughly a 50-150× storage multiplier over single-vector dense embeddings, which prices it out of most production deployments. BM25 + dense remains the cost-effective hybrid for the foreseeable future.

What to take away

Hybrid search is the boring, correct answer to RAG retrieval in 2026. The three lines worth tattooing on the team wiki:

  • Run BM25 and dense in parallel. Their failure modes don’t overlap. RRF fuses the lists without needing per-corpus tuning.
  • Budget 30-80ms for retrieval and protect it. The rest of your latency budget can’t compensate if retrieval is slow.
  • Co-locate the indices. Cross-region hops between Elasticsearch and your vector store will eat your budget faster than any algorithm change.

The teams that argue “we’ll just use a bigger embedding model and skip BM25” almost universally come back six months later and add BM25. The teams that started with hybrid never have that conversation.


Further reading: the original RRF paper from Cormack, Clarke and Buettcher (2009), the BEIR benchmark for retriever comparisons, Vespa’s hybrid search engineering writeup, and Microsoft Azure AI Search’s RRF documentation.

Skip to content