Reranking at scale: cross-encoders, ColBERT, mxbai
Retrieve-then-rerank is the single biggest accuracy improvement in production RAG. Three approaches dominate — cross-encoders, late interaction, API rerankers. Here's what each costs, what each wins, and the latency budget that decides between them.
If you read one recent paper on RAG, make it the reranker benchmarking study from late 2024 that compared fifteen rerankers across nine corpora. The headline finding is embarrassingly simple: adding any reasonable reranker to a hybrid retrieval pipeline produces a larger accuracy gain than any other single change you can make. Bigger than switching to a frontier embedding. Bigger than tuning your chunking. Bigger, often, than contextual retrieval itself.
The mechanism is straightforward. Bi-encoder retrieval (the standard “embed query, embed docs, dot-product” path) compresses each side into a single vector — fast, scalable, but lossy. A reranker reads the query and document together and produces a relevance score that captures interactions the bi-encoder couldn’t see. The cost is quadratic in candidate count, so you only run it on a small shortlist (usually 50-200) of the top retrieved candidates.
That’s the whole pattern. The interesting part is which reranker, because the three families — cross-encoders, late interaction, and API rerankers — have made very different choices on the latency-quality- cost triangle.
The shape of the win
The numbers are remarkably consistent across published case studies in the last 18 months. A representative customer-support RAG system deployed with hybrid retrieval alone hits about 67% top-1 accuracy (the right document appears as the first result). Add a reranker and that jumps to 84-89% top-1 — a 15-22 percentage point absolute improvement, depending on the reranker and the corpus.
The Databricks Mosaic team’s writeup on adding reranking to their Vector Search reported similar gains. Fin.ai’s case study on customer support RAG specifically showed reranking moving the “first result is correct” rate from the mid-60s to the high-80s. Financial QA work on 10-K filings measured correctness at score-8-or-above going from 33.5% to 49.0% — a 15.5 percentage point absolute lift purely from adding a reranker.
This is a remarkably reproducible finding for a 2024-vintage trick. The reason is that the failure mode rerankers fix — close-but-wrong neighbors crowding out the actually-relevant document at the top of the list — is endemic to bi-encoder retrieval. Once you put a model that can read query and doc together, the obvious wrong matches get demoted and the obvious right matches get promoted.
The other way to read the numbers: a reranker is doing the work a better embedding model would do if you could afford to embed every candidate query-document pair together. You can’t, because that’s quadratic. So the reranker is the compromise — the most expensive part of the pipeline, applied only on the small shortlist where the expense is justified.
Family 1: cross-encoders
A cross-encoder is a transformer that takes [CLS] query [SEP] document [SEP] as a single input and outputs a relevance score. Every layer
of the model sees both the query and the document together, which is
exactly the source of the quality advantage — and exactly the source
of the latency problem.
The 2026 cross-encoder lineup that actually ships in production:
- BGE-Reranker v2-m3 is the open-source default. It’s multilingual, runs on a single GPU, and scores within 1-2 BEIR points of the proprietary leaders. The base variant (568M parameters) is the workhorse; the lightweight
bge-reranker-v2-gemmaruns faster but trades some quality. - mxbai-rerank-v2 is the current open-source quality leader. The large variant (1.5B params) achieves 57.49 on the average BEIR benchmark — higher than Cohere or Voyage on the published numbers. The base variant (0.5B) hits 55.57, which is still better than most proprietary options. Mixedbread trained these with reinforcement learning from human feedback on relevance judgements, which is the secret of the lift.
- Cohere Rerank 3.5 is the production-managed default. The API is dead simple (
POST /rerankwith query and documents), the latency is 595-603ms average for a 150-candidate shortlist, and the quality is competitive with the open-source top. - Voyage Rerank 2 is Cohere’s main commercial competitor, with similar latency and slightly better English quality on the published benchmarks.
The cross-encoder latency math: a 150-candidate rerank with a 568M parameter model on an A10G GPU takes about 80-120ms. On a CPU, it’s unusable (1-3 seconds). On an H100, you can push to 300-candidate reranks at sub-100ms latency. The two practical levers are model size (smaller = faster, marginally worse) and candidate count (fewer = faster, fewer chances to find the right doc).
The production rule that always wins: cap your candidate count to the smallest number that contains the right answer 99% of the time. For most corpora that’s 50-100, not the 500 that someone in your team will eventually propose to “improve recall.” Beyond 100, the marginal recall gain is small and the latency cost is large.
A useful exercise before deploying: instrument your pipeline to log the rank position of the eventually-correct document for queries where the user-feedback signal tells you which doc was right. Plot the distribution. If 99% of correct documents already appear in the top 50 of your first-stage retrieval, you’ve found your cap. If you need to go to 200 to capture the 99th percentile, your first-stage retrieval has a problem worth fixing before you add more reranker candidates.
Family 2: late interaction (ColBERT and descendants)
The ColBERT architecture, which predates the current reranker wave, splits the difference between bi-encoders and cross-encoders. Each query token gets its own embedding and so does each document token. At scoring time, you compute the maximum similarity between each query token and any document token, then sum across query tokens.
The trick is that document encoding can happen offline (you store all the token-level embeddings), and the scoring is fast linear-algebra over those stored embeddings. ColBERTv2 added residual compression to make the index size manageable — each token embedding becomes a centroid plus a small residual, cutting storage by 6-10x.
The published numbers say ColBERT-style late interaction is roughly
two orders of magnitude faster than a cross-encoder at equivalent
quality, with about four orders of magnitude fewer FLOPs per query.
The catch: the index is larger (you store one vector per token rather
than per chunk), and the engineering is more complex than just calling
/rerank on an API.
Jina-ColBERT-v2 extended the
architecture to multilingual settings, and a small ecosystem of
inference engines now serves it — RAGatouille, colbert-ai, and
custom implementations on top of vLLM and SGLang.
The production reality is that ColBERT shows up in two distinct shapes:
- As a reranker on top of a bi-encoder first stage — its original ColBERT use case. This is what the published BEIR numbers measure.
- As a primary retriever for corpora under about 5M chunks, where the token-level matching is precise enough to skip the bi-encoder stage entirely.
The teams I’ve watched use ColBERT successfully share a common shape: they have an engineering culture comfortable running inference services themselves, and their latency budget is tight enough that the cross-encoder option (200ms+ for a meaningful rerank) is out. For everyone else, late interaction is a fascinating architectural footnote they should know about but probably won’t deploy.
The other place ColBERT is showing up in 2026 is multimodal retrieval. The token-level matching that makes ColBERT good at fine-grained text matching is even more useful when the “tokens” are visual patches or audio frames. Some published systems use ColBERT-style architectures to rerank multimodal candidates after a coarse CLIP-style first retrieval — a niche but growing use case as multimodal RAG becomes production-ready.
Family 3: API rerankers
The API approach is what most teams ship in production for the same reason most teams use Pinecone over self-hosted Qdrant — operational simplicity dominates everything else when you don’t have a dedicated ML platform team.
The two services that matter:
- Cohere Rerank 3.5 is the incumbent. The API has been stable since 2023, the latency is predictable, multilingual support is excellent, and the pricing is per-1000-queries (with a generous free tier for development). It’s the default reranker in LlamaIndex, LangChain, and most RAG tutorials for a reason.
- Voyage Rerank 2 is the challenger. Slightly better quality on most published benchmarks, slightly higher cost. The Voyage team’s domain expertise (they’re the same people behind Voyage 3 embeddings) means their rerankers tend to win on domain-specific tasks.
The latency picture for API rerankers: 595-603ms average for both Cohere 3.5 and Voyage 2 on a 150-candidate shortlist, measured by independent benchmarks. That’s enough to be visible to users — adding ~600ms to your query path is a real product cost. The trade is that you don’t have to run a GPU, you don’t pay for idle time, and the latency is consistent rather than spiky.
The hidden gotcha with API rerankers: they’re a network call to an external service in your hot path. The day Cohere has a regional outage, your RAG quality drops to “retrieval only” (which is still usable, but worse). The teams that ship API rerankers always have a circuit-breaker that bypasses to bi-encoder-only mode on reranker failure — and almost all of them tested it the hard way.
The other API-reranker-only feature worth noting: built-in support
for fields in the document representation. Cohere Rerank lets you
pass documents as {title: ..., body: ..., author: ...} dicts rather
than flat strings, and the model has been trained to weight fields
appropriately. For structured corpora (product catalogs, knowledge
base articles with metadata), this is meaningfully better than
concatenating fields into a single string. Self-hosted cross-encoders
can do this too, but it requires more careful prompt formatting on
your end.
The latency budget that decides the choice
The decision rule that holds up across most teams shipping RAG:
The gotchas worth knowing before you ship
A few things that bite teams adding rerankers for the first time:
Rerankers are not symmetric. A cross-encoder trained on
(query, document) pairs will not work as well if you feed it
(document, query). The input order matters. Always check what the
model card actually expects.
The candidate count is your second hyperparameter. Most teams focus on which reranker to use and ignore how many candidates to feed it. The sweet spot is usually 50-100. Going below 50 hurts recall (you’ve already discarded the right answer), going above 200 hurts latency without meaningfully improving quality.
The reranker score is not calibrated. A score of 0.95 on one query is not comparable to a score of 0.95 on another. If you’re using the reranker score as an “is this document relevant enough to show” threshold, you’ll need to calibrate per-query (or per-domain). The relative ranking is what’s reliable; the absolute score is not.
Latency is bimodal. Cross-encoder latency depends on document length, and document length distributions in real corpora are heavy- tailed. Your p50 latency might be 80ms while your p99 is 400ms. Truncate documents at indexing time to a fixed max-token budget if your latency SLO is tight.
What to take away
After watching three years of reranker deployments in production, the shape of the take-away has barely changed:
- Adding a reranker is the highest-leverage single change you can make to a RAG system. 15-22 points of top-1 accuracy is the expected lift, not the best case.
- The choice of reranker matters less than the choice to use one. The differences between mxbai-rerank-v2, Cohere Rerank 3.5, and Voyage Rerank 2 are small enough that they rarely move your end-task metric.
- Cap your candidate count. This is the production rule everyone learns the hard way; learn it the easy way.
- API rerankers are the right default for teams without GPU operations. Self-hosted cross-encoders are the right default for teams with them. ColBERT is the right answer for the narrow case where API latency is too high and you have the engineering for it.
The 2026 production RAG stack — Contextual Retrieval at index time, hybrid BM25-plus-dense at query time, a reranker on the top 50-100, an LLM consuming the top 5-10 — is a five-year accumulation of small, boring, measurable improvements. The reranker is the single largest of them. If you don’t have one yet, that’s where to start.
Further reading: the Cohere Rerank 3.5 documentation, Mixedbread’s mxbai-rerank-v2 launch, the BGE-Reranker repo, and the original ColBERTv2 paper. For a comprehensive benchmarking writeup, see the reranker survey paper from late 2024.