What is multi-head attention and why use multiple heads instead of one?
Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.
How to think about it
Multi-head attention runs several attention operations in parallel on different learned projections of Q, K, and V, then concatenates the results. Multiple heads let the model jointly attend to information from different representation subspaces and positions, capturing diverse relationships a single head would average away; the per-head dimension is the model dimension divided by the number of heads to keep total compute roughly constant.