What is the difference between batch normalization and layer normalization, and why do transformers use layer norm?

The short answer

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

How to think about it

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

Learn it properly Inside the transformer block

Keep practising

What roles do residual connections and layer normalisation play in transformer training? Describe the components of a transformer block and the difference between pre-norm and post-norm. What is batch normalisation, and why does it help training? What are the concrete reasons transformers outperform RNNs on most sequence tasks? Walk me through the transformer encoder architecture block by block.

All Deep Learning questions

Explore further

Dropout, BN, LN Vision Transformers (ViT) The Transformer Architecture

Layer Normalization Batch Normalization Transformer Batch Size scikit-learn