NLP & LLMs Medium Asked at GoogleAsked at OpenAIAsked at Meta

What is the difference between encoder models like BERT and decoder models like GPT?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

BERT is an encoder-only transformer that reads all tokens bidirectionally and is trained with masked language modelling — ideal for tasks requiring a rich contextual representation of an entire sequence, like classification or NER. GPT is a decoder-only transformer that attends only to previous tokens via a causal mask and is trained with next-token prediction — ideal for text generation. Encoder-decoder models like T5 combine both for tasks that map one sequence to another.

How to think about it

The transformer architecture has two components: an encoder that builds contextual representations, and a decoder that generates sequences token by token. Different model families use different subsets.

Encoder-only: BERT family

The encoder stack applies full (bidirectional) self-attention — every token can attend to every other token in the sequence. BERT is pretrained on two objectives: masked language modelling (MLM), where ~15% of tokens are replaced with [MASK] and the model predicts them, and next-sentence prediction (NSP).

The output is a rich, context-aware embedding for each token. This makes encoder models excellent at:

Text classification (attach a linear head to the [CLS] embedding)
Named entity recognition (token-level classification)
Semantic similarity and dense retrieval (bi-encoder setup)
Question answering span extraction

Encoder models do not naturally generate text. They lack a causal mask.

Decoder-only: GPT family

The decoder stack applies causal (left-to-right) self-attention — each token can only attend to tokens that appear before it in the sequence. This is enforced by masking future positions to negative infinity before softmax. The model is trained purely with next-token prediction.

This architecture is well-suited for:

Open-ended text generation
Code completion
In-context learning (few-shot prompting)
Long-form content, dialogue, and instruction following

Modern large language models — GPT-4, Claude, Gemini, Llama — are all decoder-only.

Encoder-decoder: T5, BART

Both stacks are present. The encoder processes the input sequence bidirectionally to build a memory representation; the decoder cross-attends to that memory while generating the output sequence autoregressively. This architecture excels at:

Summarisation
Translation
Structured prediction (text-to-SQL, text-to-code)

Quick comparison

Property	BERT (encoder)	GPT (decoder)
Attention	Bidirectional	Causal (left-to-right)
Pretraining	Masked LM	Next-token prediction
Primary use	Understanding / classification	Generation
Context of token	Full sequence	Left context only

Learn it properly BERT, GPT, T5