datarekha
Deep Learning Easy Asked at GoogleAsked at OpenAIAsked at Meta

Why does a transformer need positional encoding?

The short answer

Self-attention computes a weighted sum over value vectors where the weights depend only on dot products between queries and keys — there is no notion of position in this operation. Without an explicit positional signal injected into the token embeddings, the model cannot distinguish 'the dog bit the man' from 'the man bit the dog'.

How to think about it

Self-attention is permutation-equivariant: shuffle the input tokens and the output shuffles in exactly the same way. No order information survives the softmax weighting. RNNs get order for free because step t literally receives h_{t-1}; transformers process all positions in parallel and have no such inductive bias.

Original transformer (Vaswani et al. 2017) used fixed sinusoidal encodings added to the input embeddings before the first layer:

PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Properties of sinusoidal encodings:

  • Each dimension oscillates at a different frequency, so every position has a unique fingerprint across the full d_model-dimensional vector.
  • The dot product PE(pos) · PE(pos + k) is a function of k only — relative offsets are geometrically consistent.
  • They generalise to sequence lengths not seen during training (no learned table to look up).

Learned positional embeddings (used in BERT, GPT) treat position like a token vocabulary and embed each index with a learnable vector. They are expressive but cap at the maximum training length.

Relative positional encodings (RoPE, ALiBi) encode the distance between positions directly in the attention score, giving better length generalisation and are now dominant in large language models.

Learn it properly Positional encodings & RoPE

Keep practising

All Deep Learning questions

Explore further

Skip to content