Describe the components of a transformer block and the difference between pre-norm and post-norm.

For research-engineer ML Engineer AI / LLM Engineer

The short answer

A transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.

How to think about it

A transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.

Learn it properly Inside the transformer block

Keep practising

Walk me through the transformer encoder architecture block by block. What roles do residual connections and layer normalisation play in transformer training? What is the difference between batch normalization and layer normalization, and why do transformers use layer norm? What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures? Why does a transformer need positional encoding?

All Deep Learning questions

Explore further

The Transformer Architecture Multi-head attention Anatomy of a modern LLM

Transformer Layer Normalization Batch Normalization BERT