datarekha
Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

What does self-attention actually compute, and why is it useful?

The short answer

Self-attention lets every position in a sequence directly query every other position, producing a weighted blend of value vectors where the weights reflect learned pairwise relevance. This gives the model a constant-depth path between any two tokens regardless of how far apart they are, which is what enables transformers to capture long-range dependencies that RNNs miss.

How to think about it

Given an input sequence of n tokens represented as a matrix X ∈ ℝ^{n×d}, self-attention projects each token into three vectors using learned weight matrices:

  • Query: Q = X W_Q
  • Key: K = X W_K
  • Value: V = X W_V

The output is:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Step by step:

  1. Q K^T produces an n × n score matrix — every query dotted against every key.
  2. Dividing by sqrt(d_k) keeps the dot products from growing large enough to push softmax into its flat saturation region.
  3. Softmax converts scores to weights that sum to 1 per row.
  4. Multiplying by V blends the value vectors using those weights.

Each output token is therefore a context-aware mixture of all input token representations, weighted by how relevant each token’s key is to the current token’s query.

Input XQ = X W_QK = X W_KV = X W_VQK^T / sqrt(d_k)softmax → V
Scaled dot-product attention: X is projected to Q, K, V; scores are normalised and used to blend V.
Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content