datarekha

Inside the transformer block

Attention is only half a transformer block. The feed-forward sublayer, residual connections, and normalization — plus why pre-norm and RMSNorm became the modern default.

9 min read Advanced Deep Learning Lesson 21 of 27

What you'll learn

  • The two sublayers of a block — attention and the feed-forward (MLP) network
  • Why residual connections and normalization make deep transformers trainable
  • Pre-norm vs post-norm, and why RMSNorm is the modern default

Before you start

People say “transformers are attention,” and that’s only half true. Self-attention lets tokens share information, but a transformer block has two sublayers, not one — and the unglamorous other half, plus the wiring around both, is what actually makes a 100-layer stack trainable. This lesson is about everything that surrounds attention.

A block is two sublayers

Every transformer block is:

  1. A multi-head attention sublayer — tokens look at each other and mix information across the sequence.
  2. A feed-forward (FFN/MLP) sublayer — a per-token, two-layer network (typically expand to ~4× the width, apply a nonlinearity, project back). This is where most of a transformer’s parameters live, and where much of its “knowledge” is stored. It processes each position independently.

Attention moves information between tokens; the FFN does heavy computation on each token. You need both. And each sublayer is wrapped in the same two-part harness — a residual connection and a normalization — which is the part worth slowing down on.

Residuals: the gradient highway

Each sublayer is wrapped in a residual connection: instead of x = Sublayer(x), it’s x = x + Sublayer(x). The sublayer only has to learn a correction to its input, and — crucially — the + x gives gradients a direct path backward that skips the sublayer’s multiplications. As we saw in vanishing gradients, that identity path is what lets gradients survive dozens or hundreds of blocks. Without residuals, deep transformers simply don’t train.

Normalization: keep the scale in check

The other wrapper is normalization, which rescales activations so they don’t drift to extreme values layer after layer. The placement matters more than beginners expect:

  • Post-norm (the original 2017 Transformer): Norm(x + Sublayer(x)) — the norm sits on the residual stream itself. Works, but deep stacks are unstable and need careful learning-rate warmup.
  • Pre-norm (GPT-2 onward, now standard): x + Sublayer(Norm(x)) — the norm is inside the residual branch, so the residual highway stays clean. Far more stable at depth. Toggle it in the widget above and watch the stability proxy.
# RMSNorm — what most 2026 LLMs use (no mean subtraction, no bias)
class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(dim))
        self.eps = eps
    def forward(self, x):
        rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
        return x * rms * self.weight

The whole block, in code

Here is one complete pre-norm block — the unit that gets stacked N times to make a transformer:

class Block(nn.Module):
    def __init__(self, dim, heads, mlp_ratio=4):
        super().__init__()
        self.norm1 = RMSNorm(dim)
        self.attn  = MultiHeadAttention(dim, heads)
        self.norm2 = RMSNorm(dim)
        self.mlp   = nn.Sequential(
            nn.Linear(dim, mlp_ratio * dim),
            nn.GELU(),
            nn.Linear(mlp_ratio * dim, dim),
        )
    def forward(self, x):
        x = x + self.attn(self.norm1(x))   # pre-norm attention sublayer
        x = x + self.mlp(self.norm2(x))    # pre-norm feed-forward sublayer
        return x

See it? Two sublayers, each x = x + sublayer(norm(x)). Stack a dozen or a hundred of these and you have GPT.

Quick check

Quick check

0/3
Q1What are the two sublayers of a transformer block, and what does each do?
Q2Why are residual connections (x + Sublayer(x)) essential in deep transformers?
Q3What's the difference between pre-norm and post-norm, and which do modern LLMs use?

Next

Stack these blocks and you have the full Transformer architecture. From there: the encoder/decoder families and how Mixture-of-Experts makes the feed-forward sublayer sparse to scale capacity cheaply.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Describe the components of a transformer block and the difference between pre-norm and post-norm.

A transformer block has a multi-head self-attention sublayer and a position-wise feed-forward sublayer, each wrapped in a residual connection and normalization. Post-norm (the original transformer) applies normalization after the residual add, while pre-norm applies it inside the residual branch before the sublayer; pre-norm gives more stable gradients and is standard in modern deep LLMs.

What roles do residual connections and layer normalisation play in transformer training?

Residual connections give gradients a direct path from the loss to every layer, preventing degradation with depth. Layer normalisation stabilises activations within each token's representation independently of batch size and sequence length, enabling stable training at large depth and with the variable-length sequences typical in NLP.

Walk me through the transformer encoder architecture block by block.

Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.

What is the difference between batch normalization and layer normalization, and why do transformers use layer norm?

Batch norm normalizes each feature across the samples in a batch, so it depends on batch statistics and behaves differently in training versus inference; layer norm normalizes across the features of a single example, independent of batch size. Transformers use layer norm because sequence models have variable lengths and small or varying batches, where per-example normalization is more stable.

Related lessons

Explore further

Skip to content