datarekha
Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

Walk me through the transformer encoder architecture block by block.

The short answer

Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.

How to think about it

One encoder layer has two sub-layers, each wrapped with a residual connection and layer normalisation:

Sub-layer 1 — Multi-head self-attention:

X' = LayerNorm(X + MultiHeadAttn(X, X, X))

Sub-layer 2 — Position-wise FFN:

X'' = LayerNorm(X' + FFN(X'))

where FFN(x) = max(0, x W_1 + b_1) W_2 + b_2 — two linear layers with a ReLU in between, applied identically and independently to each position. The inner dimension is typically 4× wider than d_model (e.g. 2048 for d_model=512).

Why residual connections? They give gradients a direct shortcut from the loss back to earlier layers, preventing degradation as depth increases — the same motivation as ResNets.

Why layer normalisation? Unlike batch normalisation, LayerNorm normalises across the feature dimension for each token independently, making it sequence-length agnostic and stable across varying batch sizes — critical for autoregressive generation where batch size can be 1.

Full encoder stacks N identical layers (N=6 in the original paper; modern LLMs use 12–96+). The stack depth is what allows the model to build rich, multi-scale contextual representations.

Input embeddings + Positional Encoding

[MultiHeadAttn → Add&Norm → FFN → Add&Norm]  ×  N

Encoder output (used as K, V in cross-attention by decoder)
Learn it properly The Transformer Architecture

Keep practising

All Deep Learning questions

Explore further

Skip to content