datarekha

What is Mixture of Experts (MoE) and how does it improve LLM scalability?

The short answer

MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.

How to think about it

MoE replaces a dense feed-forward layer with many expert subnetworks plus a gating router that activates only a few experts per token. This grows total parameter count and capacity while keeping per-token compute roughly constant, since only a sparse subset of experts runs for any given token.

Learn it properly Mixture of Experts

Keep practising

All Deep Learning questions

Explore further

Skip to content