Mixture of Experts
How do you get a trillion-parameter model that costs like a 13B one to run? Don't use all the parameters for every token. That's the MoE trick behind Mixtral and friends.
What you'll learn
- Why dense transformers hit a wall: adding parameters always adds compute
- How a router picks 2 of 8 experts per token — and why that cuts active compute by 6x
- The real costs of MoE: memory, load-balance failures, and training instability
Before you start
In late 2023, Mistral dropped a model called Mixtral 8x7B. Benchmarks showed it trading blows with Llama 2 70B — a model five times larger. The inference cost was closer to a 13B model. Reviewers were confused: the model card said 46.7 B total parameters, but inference logs showed only ~13B active per token. Which number was real?
Both were. Mixtral used a technique called Mixture of Experts (MoE), and understanding it requires rethinking what “model size” even means.
The wall that dense models hit
A standard transformer feed-forward layer takes every single token through
the same set of weights on every single forward pass. If the layer has N
parameters, every token pays the compute cost of N parameters — no exceptions.
This creates a trap. You want more knowledge capacity (more parameters), but every parameter you add costs compute at inference. A 70B dense model is roughly 5x more expensive to run than a 13B model, with no opt-out.
The question MoE answers: can you add parameters without adding proportional compute?
The MoE layer: experts plus a router
MoE replaces the single dense feed-forward network with two components:
- Experts —
Eseparate feed-forward networks, each with its own weights. Mixtral usesE = 8. - Router (also called a gating network) — a small learned linear layer
that looks at each token’s embedding and outputs a probability score for
each expert. The top
kexperts by score are selected; the rest are skipped entirely. Mixtral usesk = 2.
For each token, the router picks 2 experts. The token passes through those 2 networks only. The outputs are weighted by the router scores and summed. All other 6 experts receive zero compute for that token.
The router sends each token to exactly 2 of 8 experts (highlighted). The other 6 are skipped entirely, saving ~75% of feed-forward compute per token.
The arithmetic behind Mixtral 8x7B
Each of Mixtral’s 8 experts has roughly 7B parameters in its feed-forward block. Non-expert weights (attention, embeddings, norms) are shared across all tokens and add about 7B more. So:
Total params = 8 experts x 7B + 7B shared ≈ 47B (Mistral reports 46.7B)
Active params = 2 experts x 7B + 7B shared ≈ 21B ... wait, why 13B?
The “~13B active” figure you see in benchmarks is measured against compute (FLOPs), not a raw parameter count. The 7B shared weights — attention — run once regardless, but they are smaller in MoE designs than in a pure 13B dense model. The feed-forward portion (the part you’re sparsifying) represents the majority of parameters in a transformer layer; activating only 2 of 8 expert feed-forward blocks cuts that majority by 6/8 = 75%. Combined with the shared layers, the effective active-compute budget lands near a 13B dense model.
The playground prints:
- Total parameters: 63.0B (the simplified model; Mistral’s exact tally is 46.7B because their per-expert blocks are smaller than a full 7B)
- Active per token: 21.0B
- Experts skipped: 6 of 8
- Feed-forward saved: 75%
The real Mixtral numbers differ slightly because the per-expert feed-forward dimension is tuned down, but the ratio logic is identical.
Why the router works (specialization)
The router is trained end-to-end with the rest of the model. It learns, through gradient descent, which expert is better at handling which kinds of tokens. In practice, experts develop soft specializations — one might handle code tokens more than prose, another might be better at names or numbers. This is emergent, not hand-designed. The model discovers the specialization that minimizes loss.
The real costs
MoE is not a free lunch.
Memory. All experts must be loaded into GPU memory even though only 2 are active per token. Mixtral needs ~87 GB of memory at fp16 — too large for a single A100 (80 GB). A 13B dense model fits comfortably. Inference usually requires tensor parallelism across two or more GPUs.
Load-balance collapse. If the router learns to always prefer one or two experts, those experts get overtrained and the rest become useless. Training uses an auxiliary load-balance loss that penalizes the router if expert utilization is skewed. Getting this right is non-trivial.
Training instability. Sparse routing introduces discrete routing decisions. The gradient path through the router is noisy. MoE models are harder to train than equivalent dense models and often require careful tuning of the load-balance loss coefficient.
Why frontier labs adopted MoE
DeepSeek-V2 and V3 use a finer-grained MoE with 256 experts and k = 8,
pushing utilization even lower per expert. Llama 4 (April 2025) ships
Scout (17B active of 109B total) and Maverick (17B active of 400B total), both
MoE. GPT-4 is widely believed to be a MoE model based on latency and cost
analysis, though OpenAI has not confirmed. The pattern is consistent: frontier
quality at sub-frontier inference cost.
The underlying bet is that knowledge capacity (more total parameters) and compute cost (active parameters) can be decoupled — and that the router can learn to route well enough to make the decoupling worth the memory and training complexity.
Next
Scaling laws — why bigger models trained on more tokens keep getting better, and where MoE fits in the compute-optimal frontier.
Practice this in an interview
All questionsCost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.
Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.