Mixture of experts in production: Mixtral, DeepSeek, Llama 4

The Switch Transformer paper shipped in January 2021 with a single, slightly outrageous claim: a 1.6 trillion parameter language model that trained 4× faster than a 13B dense baseline at the same FLOPs budget. The mechanism — sparse mixture of experts, where each token activates only one of N expert subnetworks — felt like research-curiosity territory. Nobody was going to serve a trillion parameters anyway.

Five years later, every open-weight frontier model is MoE. Mixtral 8x7B (December 2023) was the proof of concept that everyone could download. Mixtral 8x22B and DBRX (early 2024) pushed it to commercial quality. DeepSeek-V3 (December 2024) reached 671B total parameters with 37B active per token and shipped open-weight at GPT-4-class quality. Llama 4 (April 2025) finalised the architectural shift — all three models in the Llama 4 family are MoE.

The production question is no longer “should we use MoE?” — it’s “how do we serve it well?” That’s a different question than dense-model serving, and the answer has been worked out, mostly publicly, by the vLLM and SGLang teams over the last 18 months.

The economic claim, in one chart

The MoE pitch is straightforward: pay training cost proportional to total parameter count, pay inference cost proportional to active parameter count. For DeepSeek-V3, that’s training a 671B model at roughly 5× the cost of training a 37B dense model — and then serving it at 37B-dense inference cost forever after.

The 2026 MoE family. Each model pair is total parameters (grey, taller) vs active parameters per token (accent, much shorter). DeepSeek-V3’s 18× ratio is the most aggressive published; Mixtral’s original ~3.6× is the conservative end.

The framing that matters: a 37B-active DeepSeek-V3 query runs at the inference cost of a 37B dense model, but the quality benchmarks match a model trained with 671B-worth of capacity. That’s the fundamental economic shift that made open-weight frontier models viable in 2025. Without MoE, DeepSeek would have needed to train (and serve) a 200-300B dense model to reach the same quality — neither they nor any open-weight competitor could afford that.

How MoE actually works at inference

The architectural picture: each transformer layer has a router (a small MLP) that takes the token’s hidden state and outputs scores for N experts. The top-K experts are selected; the token is processed by each selected expert; the outputs are weighted and combined. Most production MoEs use K=2 (Mixtral, Llama 4 Scout) or K=8 (DeepSeek-V3).

The MoE routing path for a single token at a single layer. Only K of the N experts are used; the rest do no compute. This is the source of the active-vs-total parameter gap.

The interesting part — and the part most architecture explanations gloss over — is that the router decision is per token, per layer. For a 32-layer model with K=8, every token makes 32 separate routing decisions and activates 32 × 8 = 256 expert subnetworks in total. The parallelism opportunities and the load-balancing challenges both stem from this fact.

DeepSeek-V3’s specific design uses 256 routed experts plus 1 shared expert per layer. The shared expert always runs (so common patterns aren’t routed redundantly); the 8 selected routed experts contribute the specialization. This decomposition gives 32 specialized “slots” per token, which is what enables the model to behave like something much larger than 37B active parameters when the routing distributes well.

The shared-expert trick (introduced in DeepSeekMoE before V3) is one of the under-appreciated architectural details that made the “fine-grained, many experts” approach work. Without a shared expert, each routed expert has to handle both common patterns and its specialization, which dilutes the specialization. With a shared expert handling the common case, each routed expert is free to focus narrowly. The cost is a small amount of always-on compute; the benefit is that you can have 256 experts that actually behave differently from each other.

The serving challenges, and what vLLM/SGLang do about them

Serving a dense 70B model is well-understood by 2026: tensor-parallel it across 4-8 GPUs, batch requests for throughput, paged-KV-cache the context. Serving an MoE is harder for three reasons:

Problem 1: where do the experts live? A 671B-total model doesn’t fit on any single GPU. You can tensor-parallel each expert’s MLP across GPUs (treating each expert like a dense layer), but that’s wasteful because most experts are idle most of the time. The right answer is expert parallelism (EP) — different GPUs host different experts.

Both vLLM and SGLang shipped first-class EP in 2025. The mechanics: at routing time, each token’s (token, layer, expert_id) triple is sent to whichever GPU hosts that expert (via all-to-all communication). The expert computes its part. The result is sent back to the GPU hosting the token’s attention state. The all-to-all is the new bottleneck — it’s what distinguishes a good MoE serving system from a bad one.

Problem 2: routing imbalance. The router is trained to distribute tokens across experts, but the distribution is approximate. In a real batch, some experts get many more tokens than others — a phenomenon called “expert collapse” when severe. The cost is that the GPUs hosting popular experts become the latency bottleneck while the GPUs hosting unpopular experts sit idle.

DeepSeek-V3’s auxiliary-loss-free load balancing is the most elegant solution published so far. Instead of penalising imbalance with an auxiliary loss term during training (which hurts quality), they add a learnable bias to the router logits that’s adjusted at each training step based on observed expert load. The bias pushes traffic toward underloaded experts without distorting the gradient signal. The result: expert utilisation stays balanced without quality loss.

At serving time, vLLM supports synthetic routing strategies (e.g., VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random) for testing under controlled imbalance, and SGLang’s MoE scheduler reorders the batch to even out per-expert workload before dispatching.

Problem 3: KV cache + expert affinity. For conversational workloads, you want to route a user’s subsequent requests to the worker that already has their KV cache. But in an EP deployment, “the worker” is split across multiple GPUs, each owning different experts. You can’t just send the whole request to one node.

The current best answer (deployed in production by both vLLM and SGLang on the DeepSeek-V3 inference cluster) is hybrid: keep KV cache locality at the attention layer (route the user’s request to the same attention-parallel group), but accept that the expert-parallel layer will all-to-all across the whole cluster. The cost is the extra all-to-all hop; the benefit is that KV cache stays where it needs to be.

The all-to-all latency is what makes the difference between a research-paper MoE and a production-quality one. SGLang’s DeepEP communication backend, and vLLM’s analogous implementation, are optimized for the specific pattern that MoE all-to-all produces — many small messages, all-to-all between expert-parallel ranks, with strict latency requirements. A naive PyTorch all-to-all would double the per-token latency; the optimized backends bring it back to within 20-30% of the dense baseline.

What the benchmarks say

The numbers that matter from SGLang’s published DeepSeek-V3 deployment report:

52,300 input tokens/sec per node, 22,300 output tokens/sec per node with prefill-decode disaggregation on 96 H100s.
~29% higher throughput than vLLM on the same hardware for DeepSeek-V3 specifically (vLLM has narrowed this gap on subsequent releases).
Llama 4 Maverick (405B / 17B active) serves at roughly 60-80% the throughput of Llama 3.1 70B dense on the same hardware — a worse ratio than DeepSeek-V3 because Llama 4 has 128 experts per layer to DeepSeek’s 256, so per-expert traffic is denser and there’s less opportunity for batching.

The trade-off that emerges: more, smaller experts (DeepSeek-style) is better for serving throughput because you have more opportunity for load balancing, but harder to train well. Fewer, larger experts (Mixtral-style) is easier to train but harder to serve efficiently. Llama 4 split the difference at 128 experts; DeepSeek went all-in at 256; Mixtral stayed at 8. The right answer is still open.

The 2026 consensus, such as it is: 64-128 experts is the safe production range. Below 16, you don’t get enough specialization benefit. Above 256, the routing overhead and the all-to-all communication start to dominate. Both DeepSeek-V3 (256 experts, slightly aggressive) and Llama 4 Maverick (128 experts, conservative) are within ~30% of each other on serving efficiency, so the choice within this range is more about training convenience than serving cost.

The variance problem nobody talks about

The latency story for MoE serving is bimodal in a way that dense serving isn’t. For a dense model, p50 and p99 latency differ by roughly 2-3x. For an MoE in production, the gap can be 5-10x, because on a bad routing draw a token hits an over-utilised expert and queues behind dozens of other tokens.

The published case studies suggest two mitigations: (1) token shuffling at the batch level, where the scheduler reorders incoming tokens to smooth per-expert load before dispatching; and (2) replicating popular experts across multiple GPUs, so the bottleneck expert isn’t actually one GPU. Both are implemented in SGLang’s MoE scheduler; vLLM’s implementation is catching up.

For a team deploying MoE in production for the first time, the practical advice is:

Watch p99 latency, not p50. Your average user sees the median; your power users see the tail. MoE serving makes the tail worse than you expect.
Replicate hot experts across at least 2 GPUs. The cost is a small amount of extra memory; the benefit is that no single GPU becomes a queue.
Stagger token dispatching by expert frequency. The straightforward “round-robin” scheduling is the worst case; affinity-aware scheduling is the production answer.

The memory math that bites every team

The economic claim of MoE — pay inference cost proportional to active parameters — is true for compute and false for memory. A 671B-total DeepSeek-V3 needs memory for all 671B parameters loaded across the cluster, even though only 37B participate in any given forward pass. For fp16, that’s 1.3TB of GPU memory just to hold the model — roughly 16 H100s at 80GB each, before you’ve allocated any KV cache or activations.

This is the cost framing that makes MoE economics work out in production: training cost is roughly proportional to total params, serving cost is roughly proportional to active params for compute but proportional to total params for memory. The win is real but narrower than the simplest version of the pitch suggests. For a cluster operator, an MoE saves on FLOPs (and therefore on power and on per-query inference cost) but doesn’t save on the GPU count needed to host the weights.

The teams that hit this wall hardest are the ones who budgeted for “37B-equivalent inference cost” without realizing the GPU fleet would still be sized for 671B-equivalent memory. Plan for both.

What to take away

MoE is the production reality of 2026, not a research curiosity. The template is established:

Frontier models will be MoE. GPT-5, Claude 4, Llama 5 — none of these have published architectures yet, but every leak and every analyst prediction puts them at MoE with 5-15% active parameter ratios.
The serving infrastructure has caught up. vLLM and SGLang both have production-quality MoE support; the open question is whose load balancing wins at the cluster scale.
The economic logic is unambiguous. A 671B-quality model at 37B-inference cost is not a thing dense architectures can do. The training-cost premium (you still need GPUs for all those parameters during training) is paid back over the lifetime of inference.
Production MoE is not just “swap in the model.” Expert parallelism, routing imbalance, KV-cache affinity, p99 latency — these are real production concerns that didn’t exist for dense models. If your team is moving from dense to MoE serving for the first time, budget engineering time for them.

The Switch Transformer paper’s claim was that sparse activation would be the path to trillion-parameter models. Five years later it is, and the trillion-parameter models are open-weight. The technique that looked like a research curiosity in 2021 is the architecture that made open-weight frontier AI possible — and the production reality that everyone who serves models has had to figure out, fast.

Further reading: the original Switch Transformer paper, Mixtral of Experts, the DeepSeek-V3 technical report, and the Llama 4 release notes. For the serving side, see vLLM’s expert parallel deployment docs and SGLang’s writeup on MoE routing and expert parallelism.