What does self-attention actually compute, and why is it useful?
Self-attention lets every position in a sequence directly query every other position, producing a weighted blend of value vectors where the weights reflect learned pairwise relevance. This gives the model a constant-depth path between any two tokens regardless of how far apart they are, which is what enables transformers to capture long-range dependencies that RNNs miss.
How to think about it
Given an input sequence of n tokens represented as a matrix X ∈ ℝ^{n×d}, self-attention projects each token into three vectors using learned weight matrices:
- Query:
Q = X W_Q - Key:
K = X W_K - Value:
V = X W_V
The output is:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Step by step:
Q K^Tproduces ann × nscore matrix — every query dotted against every key.- Dividing by
sqrt(d_k)keeps the dot products from growing large enough to push softmax into its flat saturation region. - Softmax converts scores to weights that sum to 1 per row.
- Multiplying by
Vblends the value vectors using those weights.
Each output token is therefore a context-aware mixture of all input token representations, weighted by how relevant each token’s key is to the current token’s query.