What does the Adam optimizer do, and what problem does it solve over SGD?
Adam combines momentum (exponential moving average of gradients) with RMSProp-style adaptive per-parameter learning rates (exponential moving average of squared gradients). This means parameters with consistently large gradients get smaller effective steps, and sparse or small-gradient parameters get larger steps — making Adam nearly hyperparameter-free and fast-converging compared to vanilla SGD.
How to think about it
Adam (Adaptive Moment Estimation) is the default optimizer for most deep-learning work. It is best understood as SGD + momentum + per-parameter learning rate scaling.
The update rule
m_t = β₁·m_{t-1} + (1−β₁)·g_t # 1st moment: gradient momentum
v_t = β₂·v_{t-1} + (1−β₂)·g_t² # 2nd moment: squared-gradient avg
m̂_t = m_t / (1−β₁ᵗ) # bias correction (early steps)
v̂_t = v_t / (1−β₂ᵗ)
θ_t = θ_{t-1} − α · m̂_t / (√v̂_t + ε)
Default hyperparameters: β₁=0.9, β₂=0.999, ε=1e-8.
What each part fixes
| Component | Fixes |
|---|---|
| Momentum (m) | Noisy gradient direction; accelerates through ravines |
| Adaptive lr (v) | Different feature scales; no global lr tuning per-layer |
| Bias correction | Moment estimates are zero-biased at early steps |
import torch
model = torch.nn.Linear(128, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# AdamW is preferred when weight decay is needed
optimizer_wd = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
AdamW vs Adam
Standard Adam applies weight decay inside the adaptive update, which couples regularisation with the gradient scale. AdamW decouples them — the weight decay term is applied directly to the parameter, not to the gradient. AdamW is now the standard choice in transformer training.