What is Mixture of Experts (MoE) and how does it improve LLM scalability?

MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Mixture of Experts — Deep Learning

Here’s a riddle from the 2026 model landscape: a model can have hundreds of billions of parameters yet cost about the same to run as a model a fraction of its size. The trick is sparsity — on any given token, most of the model doesn’t fire. The mechanism is Mixture-of-Experts (MoE), and it’s why models like Mixtral, DeepSeek, and Qwen-MoE can be enormous on paper and still affordable to serve.

One big FFN → many small experts

Recall from the transformer block that the feed-forward (FFN) sublayer holds most of a transformer’s parameters and runs on every token. MoE replaces that single FFN with N smaller expert FFNs plus a tiny router. For each token, the router scores the experts and sends the token only to its top-k (usually k=1 or 2). The other experts are skipped entirely for that token.

So the model stores all N experts, but any single token only computes k of them. Play with the counts and watch the two numbers move in opposite directions:

The bargain: capacity scales with N, compute with k

That’s the whole point, stated as a tradeoff:

Parameters (capacity, “knowledge”) grow with the number of experts N.
Compute per token (FLOPs, cost) grows only with k, the experts actually used.

A model with 8 experts and top-2 routing has roughly the parameter count of 8 FFNs but the per-token compute of 2. You get a much larger, more capable model without paying to run all of it on every token. This is why “total parameters” and “active parameters” are now reported separately — a model might be “236B total, 21B active.”

The catch: load balancing

A naive router has a failure mode: it learns to love a few experts and send everything to them, while other experts get no tokens, no gradient, and never learn. That wastes most of the model’s capacity. Watch the per-expert load bars in the widget go uneven.

The standard fix is an auxiliary load-balancing loss added during training that penalizes imbalance, nudging the router to spread tokens evenly across experts. Here’s the imbalance, made concrete:

Quick check

0/3

Q1In a Mixture-of-Experts layer, what does the router do?

Q2What is the core tradeoff MoE exploits?

Q3Why do MoE models add an auxiliary load-balancing loss during training?

That completes the Deep Learning track’s modern-architecture arc. From here, the Generative AI track takes these building blocks into production — serving, RAG, reasoning models, and evaluation.

Mixture of Experts

What you'll learn

Before you start

One big FFN → many small experts

The bargain: capacity scales with N, compute with k

The catch: load balancing

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further