What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

For ML Engineer AI / LLM Engineer research-engineer

The short answer

Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.

How to think about it

Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.

Learn it properly The Transformer Architecture

Keep practising

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures? What is the difference between encoder models like BERT and decoder models like GPT? Walk me through the transformer encoder architecture block by block. What are positional encodings and why are they needed in transformers? Why does a transformer need positional encoding?

All Deep Learning questions

Explore further

BERT, GPT, T5 Multi-head attention Attention (the RNN era)

Encoder Transformer Decoder BERT scikit-learn