Attention, explained without the matrices

Here is a fact that should feel surprising: the sentence “The trophy did not fit in the bag because it was too big” contains exactly zero ambiguity for a human reader. We know it refers to the trophy, not the bag. We know this not by reading the sentence left to right and updating some running mental score, but by holding the whole sentence in mind and resolving the pronoun against the rest of the structure in a single act of comprehension.

For two decades, NLP models could not do this. They read sentences the way someone reads one word at a time through a tiny peephole — sequentially, maintaining a fixed-size memory called a hidden state, losing detail about the beginning of a sentence by the time they reached the end. The problem had a name: the vanishing gradient, the tendency of sequential models to forget distant context as they were trained. Long short-term memory networks (LSTMs) patched it, somewhat. Attention mechanisms bolted onto RNNs helped more. But the fix was always incremental, always local.

Then, in 2017, a paper called “Attention Is All You Need” proposed removing the sequential processing entirely. No recurrence. No convolution. Just attention — applied globally, in parallel, across the whole sequence at once. The authors called the result a transformer.

Today every serious language model you use is built on that idea. The question worth asking is not “what does a transformer do?” but “why is attention, specifically, the right primitive?”

The problem attention actually solves

Start with the task: given a sequence of words, produce a representation of each word that knows about its context. “Bank” means something different in “river bank” than in “central bank.” A word’s meaning is not a fixed fact — it is negotiated with the surrounding words every time the sentence is read.

An RNN solves this by reading left to right and updating a running summary vector at each step. The hidden state after seeing “central” carries, in compressed form, some memory of every word before it. By the time the model sees “bank,” it has a hidden state that ostensibly encodes “central” — but also everything else that came before, entangled and decayed through many matrix multiplications. If the sentence is short, this works fine. If the sentence is 500 words long and the relevant context is at position 3, that context has been through 497 compression steps. It is likely gone.

Attention says: why do the sequential steps at all? Each word can just look directly at every other word and decide how much each one matters.

A soft dictionary, not a hard one

The cleanest way to understand attention is as a soft dictionary lookup.

A hard dictionary is something like Python’s dict: you hand it a key, it returns the exact matching value, and anything that does not match returns nothing. Zero or one. Binary.

A soft dictionary is different. You hand it a query — a description of what you are looking for. Every entry in the dictionary has a key — a label describing what it contains — and a value — its actual content. Instead of finding the single perfect match, you compute a similarity score between your query and every key. You then take a weighted average of all the values, with weights proportional to those similarity scores.

This is precisely what self-attention computes.

For each word in the sequence, the model produces three things from that word’s current representation:

A query: what this word is looking for. If you are the word “it” in our trophy sentence, your query is something like “I need to know which nearby noun I refer to.”
A key: what this word is advertising. The word “trophy” advertises “I am a large physical object earlier in this sentence.”
A value: what this word will actually contribute if selected. The value of “trophy” is its full contextual representation — everything the model knows about it.

The mechanism then computes the similarity between “it“‘s query and every other word’s key. “Trophy” gets a high score. “Did” gets a low score. “Bag” might get a medium score since it is also a physical object. Those scores are normalized into weights that sum to one (a softmax, which just means: divide each score by the sum of all scores so you get a proper probability distribution). The final output for “it” is the weighted sum of all values, pulled heavily toward “trophy.”

The word “it” exits the attention layer with a new representation that has borrowed heavily from “trophy.” Coreference resolved. No rules. No lookup table. Just geometry.

Self-attention from the perspective of the word “it” — edge thickness and weight values reflect how much each word is borrowed.

What makes it “self” attention

The prefix “self” distinguishes this from earlier attention mechanisms, which operated between sequences — typically, an encoder attending over an input sequence to help a decoder produce output tokens. Self-attention operates within a single sequence. Every token attends to every other token in the same sequence.

This is not a small distinction. It means the model can learn, without any explicit linguistic rules, that pronouns tend to refer to nouns, that adjectives modify their neighboring nouns more than nouns three slots away, that the subject of a verb tends to be close to it — and also that it sometimes is not, and that should be fine too. All of this is learnable structure, encoded in the query, key, and value projections (the three linear layers that transform each token’s embedding into its Q, K, and V vectors).

In practice, a transformer stacks many layers of self-attention, each one producing a new set of token representations from the previous layer’s output. Early layers tend to learn local syntactic patterns (subject-verb agreement, noun phrase boundaries). Later layers tend to learn longer-range semantic relationships (coreference, topic consistency). Nobody programmed these layers to specialize this way — it emerges from training on prediction tasks.

Why this killed RNNs

The RNN versus transformer comparison reduces to two concrete engineering properties.

Parallelism. An RNN processes tokens one at a time. Token 2 cannot be computed until token 1 is done, because it needs the hidden state from step 1. This makes training on long sequences painful — you cannot use the GPU efficiently because most of the computation is waiting for its predecessor. Self-attention, by contrast, computes all pairwise similarities simultaneously. Every token’s query is dotted against every other token’s key in one tensor operation. Training a transformer on a sequence of 1,024 tokens takes roughly the same wall-clock time as training on a sequence of 64 tokens, because the matrix multiplications are equally parallelizable.

Path length. In an RNN, information travels from token 1 to token 1,000 through 999 sequential state updates. Each update compresses and potentially corrupts the signal. In self-attention, any two tokens are exactly one hop apart — the similarity between their query and key is computed directly, with no intermediate steps. The gradient flows directly from output to input without passing through the compression bottleneck. This is why attention handles long-range dependencies so much better: it does not have long-range dependencies in the computational graph. Every dependency is short-range, measured in the attention graph.

The cost is quadratic: for a sequence of length N, attention computes N squared pairs. For N equal to 512, that is 262,144 comparisons. For N equal to 32,768 (a long-context window), that is over a billion. This is why long-context models are expensive and why there is active research into sparse attention, linear attention, and other approximations. But for the sequence lengths that matter in most applications, the quadratic cost is acceptable and the benefits are decisive.

The industrial consequence

It is worth pausing on what this shift meant in practice. Before 2018, the most capable NLP models were bidirectional LSTMs trained on curated datasets with carefully engineered features. State of the art on the Stanford NLI benchmark was around 88% accuracy. BERT — the first transformer pretrained at scale — hit 91.1% on its release and has been superseded many times since, by models whose core mechanism is unchanged from the 2017 paper.

More importantly, the transformer architecture scales. More data and more parameters reliably produce better models, with no architectural changes. RNNs did not scale this cleanly — the sequential bottleneck becomes a harder limit as you add parameters, because the hidden state has to carry more and more information through a fixed-width pipe. Attention has no such bottleneck. You can make the query, key, and value projections arbitrarily large. You can add more attention heads (running the Q-K-V computation in parallel with different projections and concatenating the results). You can stack more layers. Each of these increases capacity, and the empirical scaling laws suggest the returns are predictable.

GPT-3, which appeared in 2020, was 1,000 times larger than BERT by parameter count. GPT-4 and its peers are likely another order of magnitude beyond that. None of them required a new architectural idea. They required more compute applied to the same transformer structure.

RNN vs. transformer: sequential state versus parallel pairwise attention. The transformer’s one-hop path between any two tokens is the structural reason long-range dependencies are tractable.

Multi-head attention: why one dictionary is not enough

A single set of Q-K-V projections gives you one way of deciding what to attend to. But a sentence has many simultaneous relational structures: syntactic dependency, semantic similarity, coreference, temporal order. One attention head will specialize toward whichever of these is most useful for the training objective, and will necessarily deprioritize the others.

Multi-head attention runs the Q-K-V computation several times in parallel, each with different learned projections. The outputs are concatenated and linearly projected back to the original dimension. Head 1 might learn to track syntactic subjects. Head 2 might track coreference chains. Head 3 might attend locally to adjacent tokens. Head 4 might focus on semantic field (food words attending to food words). None of this is programmed — it emerges. Probing studies, which train small classifiers on top of frozen attention heads to predict linguistic properties, have repeatedly shown that individual heads specialize in interpretable ways.

GPT-2 used 12 attention heads per layer. GPT-3 uses 96. Modern large models use on the order of 128. Each head is running a different soft dictionary lookup over the same sequence, and the model learns which lookups are worth running by gradient descent on the prediction task.

Position, and the one thing attention cannot do alone

Self-attention is, by design, permutation equivariant: if you shuffle the words in the input, the output is the same set of representations, just shuffled the same way. The attention weights between “dog” and “bit” are identical whether the sentence is “dog bit man” or “man bit dog.” Word order carries meaning. Attention, without modification, does not capture it.

The solution is positional encoding — adding a position-dependent signal to each token’s embedding before attention. The original transformer used fixed sinusoidal functions of position. Modern models use learned positional embeddings or relative position encodings (where the similarity score between two tokens is adjusted by their distance, rather than encoding absolute position). The point is that position is injected as a signal, not structurally enforced.

This is actually a feature. The model learns which positional relationships matter, rather than having them hardcoded. In some domains (code, mathematical proofs) position carries a lot of signal. In others (bags of keywords, retrieval settings) it carries less. The same architecture handles both, adjusting its positional sensitivity through training.

The invariant

Strip away the implementation details — the scaled dot product, the softmax temperature, the multi-head projection — and the invariant is simple. Self-attention is a mechanism that lets every part of a sequence directly interrogate every other part, producing a new representation for each element that is informed by the whole. The queries ask questions. The keys advertise answers. The values deliver content. The weights decide how much to listen.

Every significant language model released since 2018 — BERT, RoBERTa, T5, GPT-2, GPT-3, GPT-4, LLaMA, Claude, Gemini — is, at its core, a stack of this operation. The differences between them are in depth (number of layers), width (embedding dimension), training data, and objective. The fundamental operation is the same one described in seven pages by Vaswani et al. in 2017.

What is genuinely remarkable is that a mechanism this conceptually simple — look at everything, decide how much each thing matters, mix proportionally — turned out to be powerful enough to handle language, code, images, protein sequences, and multimodal input, with performance that improves predictably as you scale the compute. It did not turn out to be a stepping stone to something more sophisticated. It turned out to be the thing itself.

The trophy, it turns out, was too big. The bag, in this analogy, is every architecture that came before.