datarekha

What is the difference between Adam and AdamW?

The short answer

Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.

How to think about it

Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.

Learn it properly SGD → Adam → AdamW

Keep practising

All Deep Learning questions

Explore further

Skip to content