datarekha
NLP & LLMs Medium Asked at GoogleAsked at OpenAIAsked at Meta

What is the difference between encoder models like BERT and decoder models like GPT?

The short answer

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

How to think about it

The transformer architecture has two components: an encoder that builds contextual representations, and a decoder that generates sequences token by token. Different model families use different subsets.

Encoder-only: BERT family

The encoder stack applies full (bidirectional) self-attention — every token can attend to every other token in the sequence. BERT is pretrained on two objectives: masked language modelling (MLM), where ~15% of tokens are replaced with [MASK] and the model predicts them, and next-sentence prediction (NSP).

The output is a rich, context-aware embedding for each token. This makes encoder models excellent at:

  • Text classification (attach a linear head to the [CLS] embedding)
  • Named entity recognition (token-level classification)
  • Semantic similarity and dense retrieval (bi-encoder setup)
  • Question answering span extraction

Encoder models do not naturally generate text. They lack a causal mask.

Decoder-only: GPT family

The decoder stack applies causal (left-to-right) self-attention — each token can only attend to tokens that appear before it in the sequence. This is enforced by masking future positions to negative infinity before softmax. The model is trained purely with next-token prediction.

This architecture is well-suited for:

  • Open-ended text generation
  • Code completion
  • In-context learning (few-shot prompting)
  • Long-form content, dialogue, and instruction following

Modern large language models — GPT-4, Claude, Gemini, Llama — are all decoder-only.

Encoder-decoder: T5, BART

Both stacks are present. The encoder processes the input sequence bidirectionally to build a memory representation; the decoder cross-attends to that memory while generating the output sequence autoregressively. This architecture excels at:

  • Summarisation
  • Translation
  • Structured prediction (text-to-SQL, text-to-code)

Quick comparison

PropertyBERT (encoder)GPT (decoder)
AttentionBidirectionalCausal (left-to-right)
PretrainingMasked LMNext-token prediction
Primary useUnderstanding / classificationGeneration
Context of tokenFull sequenceLeft context only
Learn it properly BERT, GPT, T5

Keep practising

All NLP & LLMs questions

Explore further

Skip to content