What is the difference between batch normalization and layer normalization, and why do transformers use layer norm?
Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.
How to think about it
Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.