Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

Encoder-only models build bidirectional context representations suited for classification and embedding tasks. Decoder-only models generate text autoregressively using causal (masked) self-attention and dominate language modelling. Encoder-decoder models use a full encoder to encode the source and a decoder with cross-attention to generate the target, fitting sequence-to-sequence tasks like translation and summarisation.

How to think about it

The three families differ in which attention patterns are permitted and in their pretraining objectives:

Encoder-only (e.g. BERT, RoBERTa)

Every token attends to every other token in both directions — full bidirectional self-attention.
Pretraining: masked language modelling (predict masked tokens from full context).
Best for: sentence classification, named entity recognition, extractive QA, producing dense embeddings.
Cannot generate text natively — there is no autoregressive decoding mechanism.

Decoder-only (e.g. GPT, LLaMA, Claude)

Causal masking — each token attends only to itself and tokens to its left. Future tokens are masked out so the model cannot “cheat” during training.
Pretraining: next-token prediction (standard language modelling).
Best for: open-ended text generation, instruction following, code completion, in-context learning.
Can do classification by prompting; can encode via the final hidden state, but representation quality is lower than a bidirectional encoder.

Encoder-decoder (e.g. T5, BART, mT5)

Encoder: full bidirectional attention over the source sequence.
Decoder: causal self-attention over generated tokens + cross-attention where each decoder position attends to the full encoder output.
Pretraining: seq2seq objectives (span denoising in T5, masked+causal in BART).
Best for: translation, summarisation, question answering with extractive answers.

Encoder-only:   [CLS] token1 token2 ... tokenN → classification head
Decoder-only:   token1 → token2 → ... → tokenN (left-to-right only)
Encoder-decoder: source → Encoder → context vectors
                              ↓ (cross-attention)
                 target → Decoder → output

Learn it properly BERT, GPT, T5

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

Keep practising

Explore further