Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

Walk me through the transformer encoder architecture block by block.

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.

How to think about it

One encoder layer has two sub-layers, each wrapped with a residual connection and layer normalisation:

Sub-layer 1 — Multi-head self-attention:

X' = LayerNorm(X + MultiHeadAttn(X, X, X))

Sub-layer 2 — Position-wise FFN:

X'' = LayerNorm(X' + FFN(X'))

where FFN(x) = max(0, x W_1 + b_1) W_2 + b_2 — two linear layers with a ReLU in between, applied identically and independently to each position. The inner dimension is typically 4× wider than d_model (e.g. 2048 for d_model=512).

Why residual connections? They give gradients a direct shortcut from the loss back to earlier layers, preventing degradation as depth increases — the same motivation as ResNets.

Why layer normalisation? Unlike batch normalisation, LayerNorm normalises across the feature dimension for each token independently, making it sequence-length agnostic and stable across varying batch sizes — critical for autoregressive generation where batch size can be 1.

Full encoder stacks N identical layers (N=6 in the original paper; modern LLMs use 12–96+). The stack depth is what allows the model to build rich, multi-scale contextual representations.

Input embeddings + Positional Encoding
    ↓
[MultiHeadAttn → Add&Norm → FFN → Add&Norm]  ×  N
    ↓
Encoder output (used as K, V in cross-attention by decoder)

Learn it properly The Transformer Architecture

Walk me through the transformer encoder architecture block by block.

Keep practising

Explore further