Mixture of Experts
Frontier open models are huge on paper but cheap to run, because most of their parameters sit idle on any given token. How sparse Mixture-of-Experts buys capacity at near-fixed compute.
What you'll learn
- How a router sends each token to its top-k experts instead of one dense FFN
- Why MoE scales parameters (capacity) without scaling per-token compute
- The load-balancing problem and why MoE training adds an auxiliary loss
Before you start
Here’s a riddle from the 2026 model landscape: a model can have hundreds of billions of parameters yet cost about the same to run as a model a fraction of its size. The trick is sparsity — on any given token, most of the model doesn’t fire. The mechanism is Mixture-of-Experts (MoE), and it’s why models like Mixtral, DeepSeek, and Qwen-MoE can be enormous on paper and still affordable to serve.
One big FFN → many small experts
Recall from the transformer block that the feed-forward (FFN) sublayer holds most of a transformer’s parameters and runs on every token. MoE replaces that single FFN with N smaller expert FFNs plus a tiny router. For each token, the router scores the experts and sends the token only to its top-k (usually k=1 or 2). The other experts are skipped entirely for that token.
So the model stores all N experts, but any single token only computes k of them. Play with the counts and watch the two numbers move in opposite directions:
The bargain: capacity scales with N, compute with k
That’s the whole point, stated as a tradeoff:
- Parameters (capacity, “knowledge”) grow with the number of experts
N. - Compute per token (FLOPs, cost) grows only with
k, the experts actually used.
A model with 8 experts and top-2 routing has roughly the parameter count of 8 FFNs but the per-token compute of 2. You get a much larger, more capable model without paying to run all of it on every token. This is why “total parameters” and “active parameters” are now reported separately — a model might be “236B total, 21B active.”
The catch: load balancing
A naive router has a failure mode: it learns to love a few experts and send everything to them, while other experts get no tokens, no gradient, and never learn. That wastes most of the model’s capacity. Watch the per-expert load bars in the widget go uneven.
The standard fix is an auxiliary load-balancing loss added during training that penalizes imbalance, nudging the router to spread tokens evenly across experts. Here’s the imbalance, made concrete:
Quick check
Quick check
Next
That completes the Deep Learning track’s modern-architecture arc. From here, the Generative AI track takes these building blocks into production — serving, RAG, reasoning models, and evaluation.
Practice this in an interview
All questionsMoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.
Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.
Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.
Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.