What is multi-head attention and why use multiple heads instead of one?

For ML Engineer research-engineer AI / LLM Engineer

The short answer

Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.

How to think about it

Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.

Learn it properly Multi-head attention

Keep practising

Why use multiple attention heads instead of one large attention operation? What does self-attention actually compute, and why is it useful? Explain self-attention and the roles of the Query, Key, and Value vectors. Walk me through the transformer encoder architecture block by block. What do the query, key, and value vectors represent in attention?

All Deep Learning questions

Explore further

Self-attention Sparse & sub-quadratic attention Attention (the RNN era)

Multi-Head Attention Transformer Self-Attention Vision-Language Model (VLM)