Why do we scale the dot-product attention scores by the square root of d_k?

For ML Engineer research-engineer AI / LLM Engineer

The short answer

For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.

How to think about it

For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.

Learn it properly Self-attention

Keep practising

Why do we scale by sqrt(d_k) in scaled dot-product attention? Why is standard self-attention O(n^2) in sequence length, and how is it addressed? What does softmax do, and why is it used in the output layer? What problem does FlashAttention solve, and is it an approximation? Why does regularization require feature scaling, and what happens if you skip it?

All Deep Learning questions

Explore further

Softmax Differential attention Weight initialization FlashAttention

Weight Initialization Feature Scaling Softmax Gradient Clipping