Why does a transformer need positional encoding?
Self-attention computes a weighted sum over value vectors where the weights depend only on dot products between queries and keys — there is no notion of position in this operation. Without an explicit positional signal injected into the token embeddings, the model cannot distinguish 'the dog bit the man' from 'the man bit the dog'.
How to think about it
Self-attention is permutation-equivariant: shuffle the input tokens and the output shuffles in exactly the same way. No order information survives the softmax weighting. RNNs get order for free because step t literally receives h_{t-1}; transformers process all positions in parallel and have no such inductive bias.
Original transformer (Vaswani et al. 2017) used fixed sinusoidal encodings added to the input embeddings before the first layer:
PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
Properties of sinusoidal encodings:
- Each dimension oscillates at a different frequency, so every position has a unique fingerprint across the full
d_model-dimensional vector. - The dot product
PE(pos) · PE(pos + k)is a function ofkonly — relative offsets are geometrically consistent. - They generalise to sequence lengths not seen during training (no learned table to look up).
Learned positional embeddings (used in BERT, GPT) treat position like a token vocabulary and embed each index with a learnable vector. They are expressive but cap at the maximum training length.
Relative positional encodings (RoPE, ALiBi) encode the distance between positions directly in the attention score, giving better length generalisation and are now dominant in large language models.