What is RMSNorm and why do modern LLMs like Llama use it instead of LayerNorm?

For ML Engineer research-engineer AI / LLM Engineer

The short answer

RMSNorm normalizes activations by their root-mean-square only, dropping the mean-centering and bias terms used in LayerNorm. It is roughly 10 to 20 percent cheaper with no measurable quality loss on language modeling, which is why Llama, Mistral, and most modern open-weight LLMs adopt it.

How to think about it

Learn it properly Inside the transformer block

What is RMSNorm and why do modern LLMs like Llama use it instead of LayerNorm?

Keep practising

Explore further