Deep Learning Medium
Why do we scale the dot-product attention scores by the square root of d_k?
The short answer
For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.
How to think about it
For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.