datarekha

What is multi-head attention and why use multiple heads instead of one?

The short answer

Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.

How to think about it

Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.

Learn it properly Multi-head attention

Keep practising

All Deep Learning questions

Explore further

Skip to content