datarekha

What is the difference between batch normalization and layer normalization, and why do transformers use layer norm?

The short answer

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

How to think about it

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

Learn it properly Inside the transformer block

Keep practising

All Deep Learning questions

Explore further

Skip to content