Inside the transformer block
Attention is only half a transformer block. The feed-forward sublayer, residual connections, and normalization — plus why pre-norm and RMSNorm became the modern default.
What you'll learn
- The two sublayers of a block — attention and the feed-forward (MLP) network
- Why residual connections and normalization make deep transformers trainable
- Pre-norm vs post-norm, and why RMSNorm is the modern default
Before you start
People say “transformers are attention,” and that’s only half true. Self-attention lets tokens share information, but a transformer block has two sublayers, not one — and the unglamorous other half, plus the wiring around both, is what actually makes a 100-layer stack trainable. This lesson is about everything that surrounds attention.
A block is two sublayers
Every transformer block is:
- A multi-head attention sublayer — tokens look at each other and mix information across the sequence.
- A feed-forward (FFN/MLP) sublayer — a per-token, two-layer network (typically expand to ~4× the width, apply a nonlinearity, project back). This is where most of a transformer’s parameters live, and where much of its “knowledge” is stored. It processes each position independently.
Attention moves information between tokens; the FFN does heavy computation on each token. You need both. And each sublayer is wrapped in the same two-part harness — a residual connection and a normalization — which is the part worth slowing down on.
Residuals: the gradient highway
Each sublayer is wrapped in a residual connection: instead of x = Sublayer(x),
it’s x = x + Sublayer(x). The sublayer only has to learn a correction to its
input, and — crucially — the + x gives gradients a direct path backward that
skips the sublayer’s multiplications. As we saw in
vanishing gradients, that identity path is what
lets gradients survive dozens or hundreds of blocks. Without residuals, deep
transformers simply don’t train.
Normalization: keep the scale in check
The other wrapper is normalization, which rescales activations so they don’t drift to extreme values layer after layer. The placement matters more than beginners expect:
- Post-norm (the original 2017 Transformer):
Norm(x + Sublayer(x))— the norm sits on the residual stream itself. Works, but deep stacks are unstable and need careful learning-rate warmup. - Pre-norm (GPT-2 onward, now standard):
x + Sublayer(Norm(x))— the norm is inside the residual branch, so the residual highway stays clean. Far more stable at depth. Toggle it in the widget above and watch the stability proxy.
# RMSNorm — what most 2026 LLMs use (no mean subtraction, no bias)
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(dim))
self.eps = eps
def forward(self, x):
rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
return x * rms * self.weight
The whole block, in code
Here is one complete pre-norm block — the unit that gets stacked N times to make a transformer:
class Block(nn.Module):
def __init__(self, dim, heads, mlp_ratio=4):
super().__init__()
self.norm1 = RMSNorm(dim)
self.attn = MultiHeadAttention(dim, heads)
self.norm2 = RMSNorm(dim)
self.mlp = nn.Sequential(
nn.Linear(dim, mlp_ratio * dim),
nn.GELU(),
nn.Linear(mlp_ratio * dim, dim),
)
def forward(self, x):
x = x + self.attn(self.norm1(x)) # pre-norm attention sublayer
x = x + self.mlp(self.norm2(x)) # pre-norm feed-forward sublayer
return x
See it? Two sublayers, each x = x + sublayer(norm(x)). Stack a dozen or a
hundred of these and you have GPT.
Quick check
Quick check
Next
Stack these blocks and you have the full Transformer architecture. From there: the encoder/decoder families and how Mixture-of-Experts makes the feed-forward sublayer sparse to scale capacity cheaply.
Practice this in an interview
All questionsA transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.
Residual connections give gradients a direct path from the loss to every layer, preventing degradation with depth. Layer normalisation stabilises activations within each token's representation independently of batch size and sequence length, enabling stable training at large depth and with the variable-length sequences typical in NLP.
Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.
Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.