datarekha

What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

The short answer

Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.

How to think about it

Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.

Learn it properly The Transformer Architecture

Keep practising

All Deep Learning questions

Explore further

Skip to content