datarekha

The Transformer Architecture

The full encoder-decoder Transformer from 'Attention Is All You Need', end to end — every block, why it exists, and how the same lego pieces become BERT, GPT, and a translator.

11 min read Advanced Deep Learning Lesson 13 of 17

What you'll learn

  • The exact encoder and decoder layer recipe, in canonical order
  • Why each block exists — embeddings, positional encoding, attention, residuals, FFN
  • The load-bearing difference between masked self-attention and cross-attention
  • How encoder-only, decoder-only, and encoder-decoder are the same blocks rearranged
  • Why modern LLMs are decoder-only

Before you start

This is the overview that ties the chapter together. You have already met the pieces in detail — self-attention (the core mechanism), multi-head attention, and positional encodings. Here we assemble them into the whole machine and see the shape of the thing.

The headline from the paper: the Transformer is “based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.” No RNNs, no convolutions — just attention, a couple of MLPs, and a stack of residual connections. That simplicity is exactly why it scaled.

The two halves

The model has a left half and a right half. The encoder reads the entire input at once and turns it into a stack of context-rich vectors. The decoder generates the output one token at a time, looking both at what it has produced so far and — through cross-attention — at the encoder’s output.

ENCODER ×Nreads the whole inputbidirectional self-attentioninput → contextual vectorsself-attn → FFNno masking — every token sees allDECODER ×Ngenerates the outputmasked self-attn + cross-attnone token at a timemasked → cross → FFNcannot peek at future tokensK, V
Encoder reads; decoder writes. The K, V arrow is cross-attention — the bridge between them.

Both halves stack N identical layers — the paper used N = 6. “Identical” means identical in structure; each layer has its own independent learned weights. Every sub-layer and embedding outputs vectors of the same width d_model (512 in the paper) so that residual connections line up.

The encoder layer, exactly

Each encoder layer is two sub-layers, in this order:

  1. Multi-Head Self-Attention then Add & Norm
  2. Position-wise Feed-Forward then Add & Norm

That is the whole recipe. Attention mixes information across tokens; the feed-forward network then transforms each position on its own. Each sub-layer is independently wrapped in a residual connection followed by layer normalization — the paper’s formula is LayerNorm(x + Sublayer(x)).

The decoder layer adds one more sub-layer in the middle:

  1. Masked Multi-Head Self-Attention then Add & Norm
  2. Multi-Head Cross-Attention then Add & Norm
  3. Position-wise Feed-Forward then Add & Norm

The order matters, and a lot of mental models get it wrong. So here is the canonical stack, drawn bottom-up the way data actually flows:

ENCODER LAYERDECODER LAYERMulti-Head Self-AttentionAdd & NormFeed-ForwardAdd & Norminout (to next layer)Masked Self-AttentionAdd & NormCross-Attention (K, V from encoder)Add & NormFeed-ForwardAdd & Normin (shifted right)
Read each stack bottom-up. The decoder is the encoder layer plus a cross-attention sub-layer wedged in the middle.

Why each block is there

Strip away the names and every block earns its place by solving one concrete problem:

  • Embedding — attention does linear algebra on vectors, not on symbols. The embedding maps each token id to a learned d_model-vector so the vocabulary becomes geometry.
  • Positional Encoding — self-attention is permutation-invariant: shuffle the tokens and the math is unchanged. Without position information the model sees a bag of words. So a position-dependent signal is added (not concatenated — that is why it shares d_model) to each embedding.
  • Multi-Head Self-Attention — lets every token pull in information from every other token in one shot. Several heads run in parallel so the model can attend in different representation subspaces at once.
  • Add & Norm — the residual connection lets gradients flow straight through a deep stack; layer normalization keeps activations well-scaled. Together they are what make stacking N layers actually trainable.
  • Feed-Forward — attention mixes tokens together; the position-wise MLP (two linear layers with a ReLU between) then transforms each position non-linearly on its own.
  • Masked Self-Attention — in the decoder, future positions are blocked so a token can only depend on earlier ones. Without it the model would cheat during training by reading the answer.
  • Cross-Attention — the decoder’s queries read the encoder’s keys and values, so the output can condition on the encoded input. This is the only place the two halves talk.
  • Linear then Softmax — at the very top, a Linear layer projects the final decoder vector to one score per vocabulary word, and Softmax turns those scores into a probability distribution over the next token.

Masked self-attention is not cross-attention

This is the distinction the diagram is built to drill in, because it is the one people get wrong most often.

Sub-layerWhereQ fromK, V fromJob
Masked self-attentiondecoder, firstdecoderthe decoder’s own previous layermix in earlier output tokens, future blocked
Cross-attentiondecoder, seconddecoderthe encoder outputcondition the output on the encoded input

Same attention math, completely different sources for the keys and values. The masked self-attention layer has a causal mask so a token never sees the future. The cross-attention layer has no causal mask at all — the decoder is allowed to look at the entire input sequence, because the input is already fully known.

The output layer: Linear, then Softmax

Order matters here too. The decoder’s final hidden vector is d_model-wide. A Linear layer projects it up to the size of the vocabulary, producing one raw score — a logit — per possible next token. Softmax then exponentiates and normalizes those logits into a probability distribution. Linear first, softmax second. (More on the output layer in the softmax lesson.)

Softmax itself is three lines of NumPy — exponentiate, then divide by the sum, with a max-subtraction for numerical stability:

Every output lies in (0, 1) and they sum to 1, so you can sample from the distribution or just take the argmax for a greedy pick.

The three families — same blocks, rearranged

Here is the aha the widget is built around. The encoder, the decoder, and the output head are lego pieces. Snap them together differently and you get three different kinds of model. The full lesson on choosing between them is BERT, GPT, T5 — here is the structural summary.

FamilyWhat it keepsAttentionBuilt for
Encoder-only (BERT)encoder stack onlybidirectional, no maskunderstanding — classification, NER, embeddings
Decoder-only (GPT, Llama, Claude)decoder stack, cross-attention removedcausal/maskedgeneration — the dominant LLM design
Encoder-decoder (T5, the original)both halvesmasked + crosssequence-to-sequence — translation, summarization

The decoder-only column is worth dwelling on. A decoder-only model drops the encoder and the cross-attention sub-layer entirely — there is no encoder output to attend to. Each block is just masked self-attention plus a feed-forward network (with their norms). That is GPT, Llama, and Claude.

Why modern LLMs are decoder-only

If the original was an encoder-decoder, why did the field consolidate on decoder-only? A few reasons:

  • One objective that scales. Predicting the next token is a single, simple training signal that works on essentially any text, and it scales beautifully with data and parameters.
  • The prompt replaces the encoder. Anything you would have fed the encoder, you can just put in the prompt as context the model attends to. So the encoder is not strictly necessary.
  • Architectural simplicity. No separate encoder, no cross-attention — fewer moving parts to scale and serve.
  • It worked. The generative quality of GPT-3, ChatGPT, and GPT-4 settled the argument empirically and made decoder-only the default for general-purpose LLMs.

Encoder-only models (BERT and its descendants) are still the backbone of classification, retrieval, and embeddings; encoder-decoder models (T5, BART) still win on explicit sequence-to-sequence tasks. But for the chat-style models this site is mostly about — see what is an LLM — decoder-only is the shape.

Quick check

Quick check

0/3
Q1In the decoder, what is the difference between the masked self-attention sub-layer and the cross-attention sub-layer?
Q2At the very top of the decoder, in what order do the final operations happen, and what do they produce?
Q3You are building a model to classify the sentiment of a customer review (positive/negative). You do not need to generate any text. Which family fits best, and what does it omit compared to the full Transformer?

Next

You now have the whole machine in view. To go deeper on the pieces, revisit self-attention for the Q·K·V math, multi-head attention for the parallel subspaces, and positional encodings for how order gets injected. To go up a level into how these models are used in practice, head to what is an LLM.

Practice this in an interview

All questions
Walk me through the transformer encoder architecture block by block.

Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, with a residual connection and layer normalisation wrapped around each sub-layer. Stacking N such layers lets the network build progressively more abstract contextualised representations.

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

Encoder-only models build bidirectional context representations suited for classification and embedding tasks. Decoder-only models generate text autoregressively using causal (masked) self-attention and dominate language modelling. Encoder-decoder models use a full encoder to encode the source and a decoder with cross-attention to generate the target, fitting sequence-to-sequence tasks like translation and summarisation.

What is the difference between encoder models like BERT and decoder models like GPT?

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

Why does a transformer need positional encoding?

Self-attention computes a weighted sum over value vectors where the weights depend only on dot products between queries and keys — there is no notion of position in this operation. Without an explicit positional signal injected into the token embeddings, the model cannot distinguish 'the dog bit the man' from 'the man bit the dog'.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content