datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at OpenAIAsked at Microsoft

What does the Adam optimizer do, and what problem does it solve over SGD?

The short answer

Adam combines momentum (exponential moving average of gradients) with RMSProp-style adaptive per-parameter learning rates (exponential moving average of squared gradients). This means parameters with consistently large gradients get smaller effective steps, and sparse or small-gradient parameters get larger steps — making Adam nearly hyperparameter-free and fast-converging compared to vanilla SGD.

How to think about it

Adam (Adaptive Moment Estimation) is the default optimizer for most deep-learning work. It is best understood as SGD + momentum + per-parameter learning rate scaling.

The update rule

m_t = β₁·m_{t-1} + (1−β₁)·g_t          # 1st moment: gradient momentum
v_t = β₂·v_{t-1} + (1−β₂)·g_t²         # 2nd moment: squared-gradient avg

m̂_t = m_t / (1−β₁ᵗ)                    # bias correction (early steps)
v̂_t = v_t / (1−β₂ᵗ)

θ_t = θ_{t-1} − α · m̂_t / (√v̂_t + ε)

Default hyperparameters: β₁=0.9, β₂=0.999, ε=1e-8.

What each part fixes

ComponentFixes
Momentum (m)Noisy gradient direction; accelerates through ravines
Adaptive lr (v)Different feature scales; no global lr tuning per-layer
Bias correctionMoment estimates are zero-biased at early steps
import torch

model = torch.nn.Linear(128, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# AdamW is preferred when weight decay is needed
optimizer_wd = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

AdamW vs Adam

Standard Adam applies weight decay inside the adaptive update, which couples regularisation with the gradient scale. AdamW decouples them — the weight decay term is applied directly to the parameter, not to the gradient. AdamW is now the standard choice in transformer training.

Learn it properly SGD → Adam → AdamW

Keep practising

All Deep Learning questions

Explore further

Skip to content