Deep Learning Medium
What is the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?
The short answer
Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.
How to think about it
Encoder-only models like BERT use bidirectional attention and are best for understanding and classification; decoder-only models like GPT use causal masked attention for autoregressive generation; encoder-decoder models like T5 encode an input then attend to it while decoding, suiting sequence-to-sequence tasks like translation.