datarekha
Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

The short answer

Encoder-only models build bidirectional context representations suited for classification and embedding tasks. Decoder-only models generate text autoregressively using causal (masked) self-attention and dominate language modelling. Encoder-decoder models use a full encoder to encode the source and a decoder with cross-attention to generate the target, fitting sequence-to-sequence tasks like translation and summarisation.

How to think about it

The three families differ in which attention patterns are permitted and in their pretraining objectives:

Encoder-only (e.g. BERT, RoBERTa)

  • Every token attends to every other token in both directions — full bidirectional self-attention.
  • Pretraining: masked language modelling (predict masked tokens from full context).
  • Best for: sentence classification, named entity recognition, extractive QA, producing dense embeddings.
  • Cannot generate text natively — there is no autoregressive decoding mechanism.

Decoder-only (e.g. GPT, LLaMA, Claude)

  • Causal masking — each token attends only to itself and tokens to its left. Future tokens are masked out so the model cannot “cheat” during training.
  • Pretraining: next-token prediction (standard language modelling).
  • Best for: open-ended text generation, instruction following, code completion, in-context learning.
  • Can do classification by prompting; can encode via the final hidden state, but representation quality is lower than a bidirectional encoder.

Encoder-decoder (e.g. T5, BART, mT5)

  • Encoder: full bidirectional attention over the source sequence.
  • Decoder: causal self-attention over generated tokens + cross-attention where each decoder position attends to the full encoder output.
  • Pretraining: seq2seq objectives (span denoising in T5, masked+causal in BART).
  • Best for: translation, summarisation, question answering with extractive answers.
Encoder-only:   [CLS] token1 token2 ... tokenN → classification head
Decoder-only:   token1 → token2 → ... → tokenN (left-to-right only)
Encoder-decoder: source → Encoder → context vectors
                              ↓ (cross-attention)
                 target → Decoder → output
Learn it properly BERT, GPT, T5

Keep practising

All Deep Learning questions

Explore further

Skip to content