Explain self-attention and the roles of the Query, Key, and Value vectors.

For ML Engineer research-engineer AI / LLM Engineer

The short answer

Self-attention lets each token build a representation by attending to every other token: it scores its Query against all Keys, normalizes the scores with softmax, and takes a weighted sum of the Values. Q, K, and V are learned linear projections of the input that respectively represent what a token is looking for, what it offers as a match key, and the content it contributes.

How to think about it

Self-attention lets each token build a representation by attending to every other token: it scores its Query against all Keys, normalizes the scores with softmax, and takes a weighted sum of the Values. Q, K, and V are learned linear projections of the input that respectively represent what a token is looking for, what it offers as a match key, and the content it contributes.

Learn it properly Self-attention

Keep practising

What do the query, key, and value vectors represent in attention? What does self-attention actually compute, and why is it useful? Why does a transformer need positional encoding? Why is standard self-attention O(n^2) in sequence length, and how is it addressed? What is multi-head attention and why use multiple heads instead of one?

All Deep Learning questions

Explore further

Multi-head attention Attention (the RNN era) Differential attention

Self-Attention Positional Encoding Transformer Multi-Head Attention scikit-learn