Deep Learning Medium
What is the difference between Adam and AdamW?
The short answer
Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.
How to think about it
Adam combines momentum and per-parameter adaptive learning rates, but its L2 regularization gets entangled with the adaptive scaling. AdamW decouples weight decay from the gradient-based update, applying decay directly to the weights, which yields better generalization and is the standard optimizer for training transformers.