How do SGD, SGD with momentum, and RMSProp differ, and what does each one fix?
Vanilla SGD updates weights by a fixed fraction of the current gradient and oscillates badly in narrow loss valleys. Momentum accumulates a velocity vector that dampens oscillation and accelerates in consistent directions. RMSProp divides the learning rate by a running average of squared gradients per parameter, preventing large-gradient dimensions from dominating and stabilising training on non-stationary objectives.
How to think about it
These three optimizers form a natural progression: each solves a problem the previous one leaves unaddressed.
Vanilla SGD
θ_t = θ_{t-1} − α · g_t
Simple and memory-efficient, but the gradient at a single mini-batch is noisy and the same learning rate α is applied to every parameter. In loss landscapes with very different curvatures along different axes, the LR that avoids diverging in the steep direction makes negligible progress in the shallow direction.
SGD + Momentum
v_t = γ·v_{t-1} + α·g_t
θ_t = θ_{t-1} − v_t
Accumulates a velocity vector v. Consistent gradient directions build up speed; opposing gradients cancel, reducing oscillation. Typical γ=0.9. This is the optimizer behind most record-setting image classification results.
RMSProp
E[g²]_t = ρ·E[g²]_{t-1} + (1−ρ)·g_t²
θ_t = θ_{t-1} − α · g_t / (√E[g²]_t + ε)
Maintains a per-parameter running average of squared gradients. Dividing by this scale normalises updates: parameters that frequently receive large gradients get smaller steps; rarely-updated parameters get larger steps. Designed by Hinton for RNNs where gradient scales vary wildly across time steps.
import torch
model = torch.nn.Linear(256, 10)
# Vanilla SGD
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)
# SGD + momentum (Nesterov variant often better)
opt_momentum = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
# RMSProp
opt_rms = torch.optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99, eps=1e-8)
At a glance
| Optimizer | Fixes | Does not fix |
|---|---|---|
| SGD | — | Noise, scale mismatch, ravines |
| + Momentum | Noisy directions, ravines | Scale mismatch |
| RMSProp | Scale mismatch | Momentum / consistent direction |
| Adam | All three | Generalisation gap vs tuned SGD |