Deep Learning Medium
What is RMSNorm and why do modern LLMs like Llama use it instead of LayerNorm?
The short answer
RMSNorm normalizes activations by their root-mean-square only, dropping the mean-centering and bias terms used in LayerNorm. It is roughly 10 to 20 percent cheaper with no measurable quality loss on language modeling, which is why Llama, Mistral, and most modern open-weight LLMs adopt it.
How to think about it
RMSNorm normalizes activations by their root-mean-square only, dropping the mean-centering and bias terms used in LayerNorm. It is roughly 10 to 20 percent cheaper with no measurable quality loss on language modeling, which is why Llama, Mistral, and most modern open-weight LLMs adopt it.