datarekha

Describe the components of a transformer block and the difference between pre-norm and post-norm.

The short answer

A transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.

How to think about it

A transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.

Learn it properly Inside the transformer block

Keep practising

All Deep Learning questions

Explore further

Skip to content