datarekha

What is RMSNorm and why do modern LLMs like Llama use it instead of LayerNorm?

The short answer

RMSNorm normalizes activations by their root-mean-square only, dropping the mean-centering and bias terms used in LayerNorm. It is roughly 10 to 20 percent cheaper with no measurable quality loss on language modeling, which is why Llama, Mistral, and most modern open-weight LLMs adopt it.

How to think about it

RMSNorm normalizes activations by their root-mean-square only, dropping the mean-centering and bias terms used in LayerNorm. It is roughly 10 to 20 percent cheaper with no measurable quality loss on language modeling, which is why Llama, Mistral, and most modern open-weight LLMs adopt it.

Learn it properly Inside the transformer block

Keep practising

All Deep Learning questions

Explore further

Skip to content