What is the difference between Adam and AdamW?

The short answer

Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.

How to think about it

Learn it properly SGD → Adam → AdamW

Keep practising

What does the Adam optimizer do, and what problem does it solve over SGD? What is L2 regularisation (weight decay), and how does it reduce overfitting? How do SGD, SGD with momentum, and RMSProp differ, and what does each one fix? What's the difference between full retraining, incremental (warm-start) training, and continual online learning? What regularisation mechanisms does XGBoost add on top of standard gradient boosting?

All Deep Learning questions

Explore further

Learning-rate schedules Gradient Descent (One Step) GANs from scratch

Adam Optimizer SGD LoRA