What is Mixture of Experts (MoE) and how does it improve LLM scalability?

MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

Mixture of Experts — Generative AI

In late 2023, Mistral dropped a model called Mixtral 8x7B. Benchmarks showed it trading blows with Llama 2 70B — a model five times larger. The inference cost was closer to a 13B model. Reviewers were confused: the model card said 46.7 B total parameters, but inference logs showed only ~13B active per token. Which number was real?

Both were. Mixtral used a technique called Mixture of Experts (MoE), and understanding it requires rethinking what “model size” even means.

The wall that dense models hit

A standard transformer feed-forward layer takes every single token through the same set of weights on every single forward pass. If the layer has N parameters, every token pays the compute cost of N parameters — no exceptions.

This creates a trap. You want more knowledge capacity (more parameters), but every parameter you add costs compute at inference. A 70B dense model is roughly 5x more expensive to run than a 13B model, with no opt-out.

The question MoE answers: can you add parameters without adding proportional compute?

The MoE layer: experts plus a router

MoE replaces the single dense feed-forward network with two components:

Experts — E separate feed-forward networks, each with its own weights. Mixtral uses E = 8.
Router (also called a gating network) — a small learned linear layer that looks at each token’s embedding and outputs a probability score for each expert. The top k experts by score are selected; the rest are skipped entirely. Mixtral uses k = 2.

For each token, the router picks 2 experts. The token passes through those 2 networks only. The outputs are weighted by the router scores and summed. All other 6 experts receive zero compute for that token.

The router sends each token to exactly 2 of 8 experts (highlighted). The other 6 are skipped entirely, saving ~75% of feed-forward compute per token.

The arithmetic behind Mixtral 8x7B

Each of Mixtral’s 8 experts has roughly 7B parameters in its feed-forward block. Non-expert weights (attention, embeddings, norms) are shared across all tokens and add about 7B more. So:

Total params  = 8 experts x 7B + 7B shared  ≈ 47B (Mistral reports 46.7B)
Active params = 2 experts x 7B + 7B shared  ≈ 21B  ... wait, why 13B?

The “~13B active” figure you see in benchmarks is measured against compute (FLOPs), not a raw parameter count. The 7B shared weights — attention — run once regardless, but they are smaller in MoE designs than in a pure 13B dense model. The feed-forward portion (the part you’re sparsifying) represents the majority of parameters in a transformer layer; activating only 2 of 8 expert feed-forward blocks cuts that majority by 6/8 = 75%. Combined with the shared layers, the effective active-compute budget lands near a 13B dense model.

# Mixtral 8x7B: active vs total parameters (a simplified model).
total_experts = 8
active_experts = 2
params_per_expert_B = 7.0   # billion, feed-forward only
shared_B = 7.0              # attention + embeddings + norms

total_B = total_experts * params_per_expert_B + shared_B
active_B = active_experts * params_per_expert_B + shared_B

print("Total parameters  : " + str(round(total_B, 1)) + "B")
print("Active per token  : " + str(round(active_B, 1)) + "B")
print("Experts skipped   : " + str(total_experts - active_experts) + " of " + str(total_experts))
ff_saving_pct = (1 - active_experts / total_experts) * 100
print("Feed-forward saved: " + str(int(ff_saving_pct)) + "%")

Total parameters  : 63.0B
Active per token  : 21.0B
Experts skipped   : 6 of 8
Feed-forward saved: 75%

The simplified arithmetic gives 63 B total (8 × 7 + 7) and 21 B active (2 × 7 + 7). The real Mixtral tallies are smaller — about 46.7 B total and ~13 B active — because the per-expert feed-forward dimension is tuned down and the shared attention is leaner than a full 13 B dense model. But the ratio that matters is identical: skip 6 of 8 experts and you drop 75% of the feed-forward compute, which is what lets a 47 B-parameter model run at roughly a 13 B price.

Why the router works (specialization)

The router is trained end-to-end with the rest of the model. It learns, through gradient descent, which expert is better at handling which kinds of tokens. In practice, experts develop soft specializations — one might handle code tokens more than prose, another might be better at names or numbers. This is emergent, not hand-designed. The model discovers the specialization that minimizes loss.

The real costs

MoE is not a free lunch.

Memory. All experts must be loaded into GPU memory even though only 2 are active per token. Mixtral needs ~87 GB of memory at fp16 — too large for a single A100 (80 GB). A 13B dense model fits comfortably. Inference usually requires tensor parallelism across two or more GPUs.

Load-balance collapse. If the router learns to always prefer one or two experts, those experts get overtrained and the rest become useless. Training uses an auxiliary load-balance loss that penalizes the router if expert utilization is skewed. Getting this right is non-trivial.

Training instability. Sparse routing introduces discrete routing decisions. The gradient path through the router is noisy. MoE models are harder to train than equivalent dense models and often require careful tuning of the load-balance loss coefficient.

Why frontier labs adopted MoE

DeepSeek-V2 and V3 use a finer-grained MoE with 256 experts and k = 8, pushing utilization even lower per expert. Llama 4 (April 2025) ships Scout (17B active of 109B total) and Maverick (17B active of 400B total), both MoE. GPT-4 is widely believed to be a MoE model based on latency and cost analysis, though OpenAI has not confirmed. The pattern is consistent: frontier quality at sub-frontier inference cost.

The underlying bet is that knowledge capacity (more total parameters) and compute cost (active parameters) can be decoupled — and that the router can learn to route well enough to make the decoupling worth the memory and training complexity.

In one breath

A dense layer runs every token through all its parameters; MoE breaks that link.
An MoE layer is E experts plus a learned router that picks the top-k per token (Mixtral: 2 of 8).
Skipping E−k experts cuts feed-forward compute by (E−k)/E — so 47 B params can run near a 13 B price.
The costs are real: all experts sit in memory, the router can collapse (needs a load-balance loss), and training is less stable.
The bet, now standard at the frontier (DeepSeek, Llama 4, likely GPT-4): decouple capacity (total params) from cost (active params).

Quick check

0/3

Q1Mixtral 8x7B routes each token to 2 of 8 experts. Roughly what fraction of the feed-forward compute is saved compared to activating all 8 experts?

Q2A production team notices their MoE model's router is sending 90% of tokens to the same two experts. What training fix addresses this?

Q3A new model is announced: 600B total parameters, 30B active per token, 20 experts. A competitor claims this 'runs like a 30B model in every way.' What is the most important caveat that claim ignores?

Next, multimodal LLMs — how the same token machinery extends to images and audio, by turning a picture into a sequence of tokens the model reads exactly like text.

Mixture of Experts

What you'll learn

Before you start

The wall that dense models hit

The MoE layer: experts plus a router

The arithmetic behind Mixtral 8x7B

Why the router works (specialization)

The real costs

Why frontier labs adopted MoE

In one breath

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further