datarekha

Why do we scale the dot-product attention scores by the square root of d_k?

The short answer

For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.

How to think about it

For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.

Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content