What do the query, key, and value vectors represent in attention?
The query represents what a token is looking for, the key represents what a token is advertising about itself, and the value is the content it contributes if selected. Attention scores measure query-key compatibility, and the output is a soft retrieval: a weighted sum of values where the weights come from those compatibility scores.
How to think about it
The query/key/value abstraction comes from the idea of a differentiable associative lookup:
- Query (Q): “What kind of information do I need?” — projected from the current token.
- Key (K): “What kind of information do I provide?” — projected from every token in the sequence.
- Value (V): “What is the actual content I provide?” — also projected from every token.
The dot product Q_i · K_j measures how well token j can answer the information need of token i. After scaling by sqrt(d_k) and softmax, this score becomes a weight:
a_{ij} = softmax(Q_i K_j^T / sqrt(d_k))
output_i = sum_j a_{ij} V_j
Think of it as a soft database query: instead of retrieving exactly one matching row (hard lookup), attention blends all rows weighted by their relevance to the query.
Why three separate projections instead of one?
If Q = K = V = X (raw input), the model has no freedom to learn separate “what I need” versus “what I offer” representations. Having three independent weight matrices W_Q, W_K, W_V lets the model project each notion into a task-optimal subspace. In practice the learned Q and K spaces often differ substantially from V.
import torch.nn.functional as F
# d_k = 64, n = sequence length
scores = (Q @ K.transpose(-2, -1)) / (64 ** 0.5) # (n, n)
weights = F.softmax(scores, dim=-1) # rows sum to 1
output = weights @ V # (n, d_v)