datarekha

Mixture of Experts

How do you get a trillion-parameter model that costs like a 13B one to run? Don't use all the parameters for every token. That's the MoE trick behind Mixtral and friends.

8 min read Advanced Generative AI Lesson 5 of 24

What you'll learn

  • Why dense transformers hit a wall: adding parameters always adds compute
  • How a router picks 2 of 8 experts per token — and why that cuts active compute by 6x
  • The real costs of MoE: memory, load-balance failures, and training instability

Before you start

In late 2023, Mistral dropped a model called Mixtral 8x7B. Benchmarks showed it trading blows with Llama 2 70B — a model five times larger. The inference cost was closer to a 13B model. Reviewers were confused: the model card said 46.7 B total parameters, but inference logs showed only ~13B active per token. Which number was real?

Both were. Mixtral used a technique called Mixture of Experts (MoE), and understanding it requires rethinking what “model size” even means.

The wall that dense models hit

A standard transformer feed-forward layer takes every single token through the same set of weights on every single forward pass. If the layer has N parameters, every token pays the compute cost of N parameters — no exceptions.

This creates a trap. You want more knowledge capacity (more parameters), but every parameter you add costs compute at inference. A 70B dense model is roughly 5x more expensive to run than a 13B model, with no opt-out.

The question MoE answers: can you add parameters without adding proportional compute?

The MoE layer: experts plus a router

MoE replaces the single dense feed-forward network with two components:

  1. ExpertsE separate feed-forward networks, each with its own weights. Mixtral uses E = 8.
  2. Router (also called a gating network) — a small learned linear layer that looks at each token’s embedding and outputs a probability score for each expert. The top k experts by score are selected; the rest are skipped entirely. Mixtral uses k = 2.

For each token, the router picks 2 experts. The token passes through those 2 networks only. The outputs are weighted by the router scores and summed. All other 6 experts receive zero compute for that token.

TokenRoutertop-2E1E2E3E4E5–E8Weighted→ 47B total params→ ~13B active

The router sends each token to exactly 2 of 8 experts (highlighted). The other 6 are skipped entirely, saving ~75% of feed-forward compute per token.

The arithmetic behind Mixtral 8x7B

Each of Mixtral’s 8 experts has roughly 7B parameters in its feed-forward block. Non-expert weights (attention, embeddings, norms) are shared across all tokens and add about 7B more. So:

Total params  = 8 experts x 7B + 7B shared  ≈ 47B (Mistral reports 46.7B)
Active params = 2 experts x 7B + 7B shared  ≈ 21B  ... wait, why 13B?

The “~13B active” figure you see in benchmarks is measured against compute (FLOPs), not a raw parameter count. The 7B shared weights — attention — run once regardless, but they are smaller in MoE designs than in a pure 13B dense model. The feed-forward portion (the part you’re sparsifying) represents the majority of parameters in a transformer layer; activating only 2 of 8 expert feed-forward blocks cuts that majority by 6/8 = 75%. Combined with the shared layers, the effective active-compute budget lands near a 13B dense model.

The playground prints:

  • Total parameters: 63.0B (the simplified model; Mistral’s exact tally is 46.7B because their per-expert blocks are smaller than a full 7B)
  • Active per token: 21.0B
  • Experts skipped: 6 of 8
  • Feed-forward saved: 75%

The real Mixtral numbers differ slightly because the per-expert feed-forward dimension is tuned down, but the ratio logic is identical.

Why the router works (specialization)

The router is trained end-to-end with the rest of the model. It learns, through gradient descent, which expert is better at handling which kinds of tokens. In practice, experts develop soft specializations — one might handle code tokens more than prose, another might be better at names or numbers. This is emergent, not hand-designed. The model discovers the specialization that minimizes loss.

The real costs

MoE is not a free lunch.

Memory. All experts must be loaded into GPU memory even though only 2 are active per token. Mixtral needs ~87 GB of memory at fp16 — too large for a single A100 (80 GB). A 13B dense model fits comfortably. Inference usually requires tensor parallelism across two or more GPUs.

Load-balance collapse. If the router learns to always prefer one or two experts, those experts get overtrained and the rest become useless. Training uses an auxiliary load-balance loss that penalizes the router if expert utilization is skewed. Getting this right is non-trivial.

Training instability. Sparse routing introduces discrete routing decisions. The gradient path through the router is noisy. MoE models are harder to train than equivalent dense models and often require careful tuning of the load-balance loss coefficient.

Why frontier labs adopted MoE

DeepSeek-V2 and V3 use a finer-grained MoE with 256 experts and k = 8, pushing utilization even lower per expert. Llama 4 (April 2025) ships Scout (17B active of 109B total) and Maverick (17B active of 400B total), both MoE. GPT-4 is widely believed to be a MoE model based on latency and cost analysis, though OpenAI has not confirmed. The pattern is consistent: frontier quality at sub-frontier inference cost.

The underlying bet is that knowledge capacity (more total parameters) and compute cost (active parameters) can be decoupled — and that the router can learn to route well enough to make the decoupling worth the memory and training complexity.


Next

Scaling laws — why bigger models trained on more tokens keep getting better, and where MoE fits in the compute-optimal frontier.

Practice this in an interview

All questions
What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

What is mixed precision training and why does it matter?

Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.

How do you optimise GPU utilization for model serving, and what role does dynamic batching play?

GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content