Walk me through the transformer encoder architecture block by block.
Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.
How to think about it
One encoder layer has two sub-layers, each wrapped with a residual connection and layer normalisation:
Sub-layer 1 — Multi-head self-attention:
X' = LayerNorm(X + MultiHeadAttn(X, X, X))
Sub-layer 2 — Position-wise FFN:
X'' = LayerNorm(X' + FFN(X'))
where FFN(x) = max(0, x W_1 + b_1) W_2 + b_2 — two linear layers with a ReLU in between, applied identically and independently to each position. The inner dimension is typically 4× wider than d_model (e.g. 2048 for d_model=512).
Why residual connections? They give gradients a direct shortcut from the loss back to earlier layers, preventing degradation as depth increases — the same motivation as ResNets.
Why layer normalisation? Unlike batch normalisation, LayerNorm normalises across the feature dimension for each token independently, making it sequence-length agnostic and stable across varying batch sizes — critical for autoregressive generation where batch size can be 1.
Full encoder stacks N identical layers (N=6 in the original paper; modern LLMs use 12–96+). The stack depth is what allows the model to build rich, multi-scale contextual representations.
Input embeddings + Positional Encoding
↓
[MultiHeadAttn → Add&Norm → FFN → Add&Norm] × N
↓
Encoder output (used as K, V in cross-attention by decoder)