Explain self-attention and the roles of the Query, Key, and Value vectors.
Self-attention lets each token build a representation by attending to every other token: it scores its Query against all Keys, normalizes the scores with softmax, and takes a weighted sum of the Values. Q, K, and V are learned linear projections of the input that respectively represent what a token is looking for, what it offers as a match key, and the content it contributes.
How to think about it
Self-attention lets each token build a representation by attending to every other token: it scores its Query against all Keys, normalizes the scores with softmax, and takes a weighted sum of the Values. Q, K, and V are learned linear projections of the input that respectively represent what a token is looking for, what it offers as a match key, and the content it contributes.