datarekha
RAG May 1, 2026

Embeddings in 2026: Voyage, Cohere v4, OpenAI text-embedding-3, mxbai, BGE

The embedding model market split clean. Proprietary models (Voyage 3, Cohere v4, OpenAI v3) own the quality top; open-source (mxbai, BGE-M3, Jina v3, E5) own cost and self-host. Matryoshka rewrote the cost economics. Here's how to actually choose.

12 min read · by datarekha · embeddingsvoyagecohereopenaimxbai

In 2023, “which embedding model should I use?” had one answer: text-embedding-ada-002. Everyone used it. It cost almost nothing. It was good enough. The honest reason it dominated was that the open-source alternatives were two years behind and the only competitor with real brand recognition was Cohere.

The picture in 2026 is unrecognisable. There are now five proprietary embedding APIs that beat ada-002 on every meaningful benchmark, three of which beat OpenAI’s own next-gen text-embedding-3-large. There are six open-source models that close 90%+ of the gap and run on a single GPU. And the technique that changed the economics — Matryoshka Representation Learning — means the “premium” embedding from Voyage or Cohere can be truncated to 256 dimensions for index storage while retaining most of its retrieval quality.

This post is the practical map. What the MTEB leaderboard says, where the rankings break down by domain, and the decision logic that actually fits production constraints.

The four-tier market

The model landscape sorted itself into four distinct tiers. Picking the right tier matters more than picking within the tier:

EMBEDDING MODEL TIERS, 2026TIER 1 proprietary, frontier qualityCohere Embed v4 ($0.12/M) · Voyage 3-large ($0.12/M) · OpenAI text-embedding-3-large ($0.13/M)MTEB 63-65 · 1024-3072 dim · 8K-128K context · Matryoshka supportTIER 2 proprietary, cheap defaultOpenAI text-embedding-3-small ($0.02/M) · Voyage 3 ($0.06/M) · Voyage 3-lite ($0.02/M)MTEB 59-62 · 512-1536 dim · the new “default” for cost-conscious RAGTIER 3 open-source, hosted-or-selfBGE-M3 · mxbai-embed-large-v1 · Jina v3 · E5-mistral · nomic-embed-text-v1.5MTEB 58-63 · 768-1024 dim · run on a single 24GB GPU · MIT or ApacheTIER 4 fast on-deviceall-MiniLM-L6-v2 · bge-small-en-v1.5 · gte-smallMTEB 49-56 · 384-512 dim · run on CPU at thousands of docs/sec
The four tiers, with representative models. The price-per-million-tokens numbers are from the providers’ own pricing pages as of May 2026.

The right tier depends on three things: how sensitive your task is to the last few MTEB points, how much you care about the difference between $0.02 and $0.13 per million tokens, and whether “self-host” appears on your requirements list.

A quick sanity check on the numbers. The MTEB leaderboard includes 56 tasks across 8 categories (retrieval, classification, clustering, reranking, STS, etc.), and the “average MTEB” most people quote is the mean across all of them. That’s a slightly misleading metric for RAG specifically — the retrieval subset is what matters, and the proprietary models’ lead is larger on retrieval than on the average. A model that’s 64.6 average is often 70+ on retrieval-only, because retrieval is what the training data is heaviest on.

Tier 1 in detail: what the benchmarks actually say

On the MTEB English leaderboard as of mid-2026, the proprietary top is tight enough that the choice isn’t about quality in the abstract — it’s about which domain matters most to you:

  • Cohere Embed v4 (65.2 MTEB) — overall winner on aggregate scores. The 128K-token context window is the largest of any major provider, and binary quantization is native (the same vector can be served as float32 for high recall and as a 1-bit bitmap for cheap top-k pre-filtering). The multilingual numbers are state-of-the-art across 100+ languages.
  • Voyage 3-large (64.8 MTEB, 70+ on retrieval-only benchmarks) — wins on domain-specific retrieval by a 4-6 point margin. Their domain-tuned variants (voyage-code-3, voyage-law-2, voyage-finance-2, voyage-multilingual-3) consistently top the relevant subsets of BEIR. If you’re embedding code, legal contracts, medical literature, or financial filings, this is the model.
  • OpenAI text-embedding-3-large (64.6 MTEB) — the everywhere-default. 3072 dimensions natively (truncatable to 256, 512, 1024, or 2048 via Matryoshka), $0.13/M tokens, and the same SDK and reliability story as the rest of OpenAI’s stack. For most generic English text, it’s the choice you don’t need to defend.

The thing that makes these three a real tier and not a marketing trio is that the gap to tier 2 is consistent and measurable: 3-5 MTEB points, which translates to roughly 5-10% higher top-10 retrieval recall on realistic RAG corpora. For a customer-support RAG, that’s the difference between a 78% top-1 hit rate and an 84% top-1 hit rate.

The opinion most labs converge on, after running their own A/B tests on representative corpora: OpenAI’s text-embedding-3-large is the right default for English RAG, Voyage wins where the corpus is domain-specific enough that you can measure the lift, and Cohere wins where multilingual or long-context is the binding constraint. Picking any of the three over ada-002 is a clear upgrade; picking among them is a second-order optimization that should be driven by your eval set, not by which provider has the loudest marketing this month.

Tier 2: the new “good enough” default

The real news in tier 2 is that text-embedding-3-small and voyage-3-lite both sit at $0.02/M tokens — six times cheaper than the tier-1 models — while scoring within 3-4 points of MTEB. For a lot of production RAG, that gap doesn’t move the needle on the end-task metric. The user can’t tell whether the top-1 hit rate is 78% or 81%, because either way the LLM has the right document in context.

What you give up in tier 2:

  • Context length. text-embedding-3-small is 8K; many tier-1 models go to 32K or 128K. If you’re embedding contracts or research papers without chunking, this constrains you.
  • Domain robustness. The small models are trained on the same corpora as the large ones, but with less capacity to specialise. On a narrow domain (medical, legal) the small model drops further behind. On generic English, it barely matters.
  • The Matryoshka knob. You can still truncate dimensions, but the resulting quality at 256 dim from a small model is worse than 256 dim truncated from a large model.

The rule that’s emerged at most teams: use a tier-1 model for the high-traffic queries where the last few percent of accuracy translates to user trust, and a tier-2 model when you’re embedding huge corpora where the cost would be prohibitive at $0.13/M tokens. Most production RAG systems can use tier 2 for indexing and tier 1 (or a reranker) where the query-side ranking matters most.

Tier 3: the open-source story finally caught up

The four open-source embedding models that matter in 2026 are BGE-M3, mxbai-embed-large-v1, Jina embeddings v3, and E5-mistral. Each one closes the gap to proprietary in a slightly different direction:

BGE-M3 is the multilingual + hybrid winner. It produces three different vector types from the same forward pass — dense, sparse (lexical), and ColBERT-style multi-vector — which means one model can serve all three retrieval signals in a hybrid search pipeline. For multilingual retrieval across 100+ languages, BGE-M3 is competitive with Cohere’s multilingual offering at zero per-token cost.

mxbai-embed-large-v1 is the English MRL winner. It’s a 1024-dimension model trained explicitly for Matryoshka, so you can truncate to 512 or 256 dimensions with minimal quality loss. The Mixedbread team also released mxbai-embed-2d-large-v1, which does two-dimensional Matryoshka — you can independently tune both the embedding dimension and the number of transformer layers used at inference, which is useful on edge devices.

Jina embeddings v3 is the long-context + task-specific winner. The 8192-token context is competitive with proprietary, and the task-LoRA architecture means the same base model serves five distinct tasks (retrieval-doc, retrieval-query, classification, clustering, text-matching) via different LoRA adapters plugged in at runtime. The 570M parameter count fits on a small GPU.

The honest summary of tier 3: if you’re willing to operate the inference infrastructure, you can get within 1-3 MTEB points of the proprietary top tier at zero per-token cost. The crossover point where self-hosting pays off is usually somewhere around 500M-1B embedding tokens per month, depending on your GPU rental rates.

The two other reasons teams go open-source that have nothing to do with cost: data residency (you can’t send the corpus to OpenAI because it’s regulated patient data, customer PII, or classified content), and embedding-model fine-tuning (you want to nudge the model on domain-specific synonyms or jargon, which proprietary APIs don’t let you do). Both of those are common in enterprise, and both push teams to tier 3 regardless of cost.

The fourth tier — the all-MiniLM family and friends — is the one people forget exists once they’ve experienced the proprietary top. But for use cases where embeddings have to run on a phone, in a browser via WASM, or alongside a search service that handles millions of QPS at the cost of an Elasticsearch node, the small models are still the only option. Chroma’s default embedding is all-MiniLM-L6-v2 for a reason — it’s the one that fits in 90MB.

The Matryoshka revolution, explained in three lines

The technique that changed the economics: train an embedding model so that the first N dimensions of every output vector form a valid lower- dimensional embedding on their own. Now you can store the full 3072-dim vector once and serve queries at any dimension you want — 256 for the cheap pre-filter, 1024 for the rerank, 3072 if you really need the last percent of recall.

PLAIN EMBEDDINGMATRYOSHKA EMBEDDINGvector: 3072 floatssingle fixed-size representationstorage cost: 12 KB per vectorretrieval: must use all 3072 dim1B vectors = 12 TB indexvector: 3072 floats, nested256+512+2304store full 3072 oncequery at any prefix you want1B vectors at 256 dim = 1 TB
Matryoshka in one picture. The same vector serves multiple budgets — the first 256 floats stand alone as a usable embedding, the first 1024 as a higher-quality one, and the full 3072 for the highest recall. Storage and query cost scale linearly with dimension.

OpenAI’s text-embedding-3-large supports MRL out of the box: pass dimensions=256 and you get the truncated vector, and the quality on retrieval tasks degrades much more gracefully than truncating a non-MRL vector. Voyage, Nomic, mxbai, and Jina v3 all support it. Cohere’s v4 takes a slightly different approach with native binary quantization, which gets to similar storage savings via a different mechanism.

The interesting research direction in 2025 was Matryoshka-Adaptor, which retrofits Matryoshka behaviour onto a pre-trained non-MRL embedding model with a small adaptor layer. The published numbers show the adaptor recovering 80-90% of the quality loss from naive truncation, which means even older embedding models can be retrofitted to be storage-efficient without retraining from scratch.

The practical implication for cost: a 100M-document RAG corpus embedded at 3072 dimensions in Pinecone would cost roughly $300/month just in storage. The same corpus truncated to 256 dimensions costs roughly $25/month, with retrieval quality dropping by 2-4 percentage points. For most use cases, that’s a trade worth making — and you only got to make it because of MRL.

The hidden cost of embedding migrations

A note worth saying out loud, because it bites people who skip it: switching embedding models is not free. The vectors generated by different models are not comparable — a query embedded by Voyage 3 and a document embedded by text-embedding-3-small are in different spaces and will produce nonsense distances. So when you “switch embedding models,” what you’re actually doing is re-embedding your entire corpus.

For a 100M-document corpus at $0.13/M tokens (Voyage 3-large or text-embedding-3-large), the one-time cost is about $1,300 to re-embed. That’s not prohibitive, but it’s also not nothing. The larger pain is the operational migration — you need a dual-write window where both indexes are queried, then a cutover, then deletion of the old index. For most production teams that’s a 2-4 week project end to end, dominated not by the embedding cost but by the engineering around the dual-index period.

The implication: pick your embedding model with the assumption you’ll keep it for 12-18 months. That horizon used to mean you’d be locked into something obsolete; in 2026, the rate of frontier-tier embedding improvement has slowed enough that this is no longer the risk it once was. The big proprietary models have been roughly stable on MTEB since mid-2024; the gains since then have been on long-context, multilingual breadth, and dimension flexibility, not on raw retrieval quality.

What the domain-specific benchmarks show

The MTEB average obscures what most production RAG teams actually care about. A real-world example: a customer-support RAG comparing the top-3 retrieval accuracy of five models on a 50K-document internal knowledge base. The corpus is half English, half multilingual, with a mix of API documentation, troubleshooting guides, and policy pages.

The numbers that emerged from a representative published evaluation:

  • Voyage 3-large at 1024 dim — 84.2% top-3 accuracy. Best overall, with the lead growing on the multilingual subset.
  • OpenAI text-embedding-3-large at 1024 dim (MRL truncated) — 82.1%. The closest non-Voyage option, slightly behind on the multilingual half.
  • Cohere Embed v4 at 1024 dim — 81.7%. Multilingual lead is real but doesn’t show up on this corpus; cost was the deciding factor against it.
  • mxbai-embed-large-v1 at 1024 dim (self-hosted) — 79.4%. The open-source contender; closes 90% of the gap to Voyage at zero per-token cost.
  • text-embedding-3-small at 512 dim (MRL truncated) — 76.8%. The cheap default; 7 points behind the leader, but at one-sixth the per-token cost and a quarter of the storage.

The lesson: the gap between tier 1 and tier 2 is consistent (~7-8 points of top-3 accuracy), the gap between tier 1 proprietary and tier 3 open-source is small (~5 points), and the within-tier-1 choices are within margin of error on a single corpus. Pick the tier that matches your constraints; pick within tier by what your own eval set says.

A practical decision

The decision that holds up across most teams I’ve watched ship RAG in the last year:

  • Default for English RAG, generic domain: OpenAI text-embedding-3-large at 1024 dim via MRL. You’ll never need to defend this choice; it’s good enough.
  • Domain-specific (code, legal, medical, finance): Voyage’s domain models. The 4-6 point MTEB lead on the relevant subset is real and translates to user-visible quality.
  • Multilingual: Cohere Embed v4 for breadth, BGE-M3 if you want to self-host.
  • Cost-constrained at scale: Voyage 3-lite or text-embedding-3-small at $0.02/M, indexed at 256-512 dim via MRL.
  • Self-host required (data residency, air-gap, sub-100ms p99): mxbai-embed-large or BGE-M3 on your own GPU.
  • Edge / on-device: all-MiniLM-L6-v2 (still). Nothing else fits in 90MB of RAM with similar quality.

What hasn’t changed: the embedding model is rarely the bottleneck of your RAG system. The chunk-context preamble matters more, the retrieval strategy matters more, the reranker matters more, the prompt design matters more. Spend the engineering time where the leverage is, and default to the embedding model that requires the least defense — which in 2026 is OpenAI text-embedding-3-large for the generic case and Voyage’s domain models everywhere else.

The era of “just use ada-002 and forget about it” is over. The era of “the embedding model is a strategic technical choice that meaningfully affects your product” started somewhere in late 2024 and isn’t going back.


Further reading: the MTEB leaderboard is the source of truth for benchmark scores. OpenAI’s text-embedding-3 announcement introduced MRL into the API. Mixedbread’s mxbai-embed-large-v1 release post is the best open-source reference. The original Matryoshka Representation Learning paper is short and worth reading.

Skip to content