Deep Learning Hard
What is Mixture of Experts (MoE) and how does it improve LLM scalability?
The short answer
MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.
How to think about it
MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.