Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

What does self-attention actually compute, and why is it useful?

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

Self-attention lets every position in a sequence directly query every other position, producing a weighted blend of value vectors where the weights reflect learned pairwise relevance. This gives the model a constant-depth path between any two tokens regardless of how far apart they are, which is what enables transformers to capture long-range dependencies that RNNs miss.

How to think about it

Given an input sequence of n tokens represented as a matrix X ∈ ℝ^{n×d}, self-attention projects each token into three vectors using learned weight matrices:

Query: Q = X W_Q
Key: K = X W_K
Value: V = X W_V

The output is:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Step by step:

Q K^T produces an n × n score matrix — every query dotted against every key.
Dividing by sqrt(d_k) keeps the dot products from growing large enough to push softmax into its flat saturation region.
Softmax converts scores to weights that sum to 1 per row.
Multiplying by V blends the value vectors using those weights.

Each output token is therefore a context-aware mixture of all input token representations, weighted by how relevant each token’s key is to the current token’s query.

Scaled dot-product attention: X is projected to Q, K, V; scores are normalised and used to blend V.

Learn it properly Self-attention

What does self-attention actually compute, and why is it useful?

Keep practising

Explore further