Deep Learning Medium Asked at GoogleAsked at DeepMindAsked at Amazon

How do SGD, SGD with momentum, and RMSProp differ, and what does each one fix?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Vanilla SGD updates weights by a fixed fraction of the current gradient and oscillates badly in narrow loss valleys. Momentum accumulates a velocity vector that dampens oscillation and accelerates in consistent directions. RMSProp divides the learning rate by a running average of squared gradients per parameter, preventing large-gradient dimensions from dominating and stabilising training on non-stationary objectives.

How to think about it

These three optimizers form a natural progression: each solves a problem the previous one leaves unaddressed.

Vanilla SGD

θ_t = θ_{t-1} − α · g_t

Simple and memory-efficient, but the gradient at a single mini-batch is noisy and the same learning rate α is applied to every parameter. In loss landscapes with very different curvatures along different axes, the LR that avoids diverging in the steep direction makes negligible progress in the shallow direction.

SGD + Momentum

v_t = γ·v_{t-1} + α·g_t
θ_t = θ_{t-1} − v_t

Accumulates a velocity vector v. Consistent gradient directions build up speed; opposing gradients cancel, reducing oscillation. Typical γ=0.9. This is the optimizer behind most record-setting image classification results.

RMSProp

E[g²]_t = ρ·E[g²]_{t-1} + (1−ρ)·g_t²
θ_t = θ_{t-1} − α · g_t / (√E[g²]_t + ε)

Maintains a per-parameter running average of squared gradients. Dividing by this scale normalises updates: parameters that frequently receive large gradients get smaller steps; rarely-updated parameters get larger steps. Designed by Hinton for RNNs where gradient scales vary wildly across time steps.

import torch

model = torch.nn.Linear(256, 10)

# Vanilla SGD
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)

# SGD + momentum (Nesterov variant often better)
opt_momentum = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# RMSProp
opt_rms = torch.optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99, eps=1e-8)

At a glance

Optimizer	Fixes	Does not fix
SGD	—	Noise, scale mismatch, ravines
+ Momentum	Noisy directions, ravines	Scale mismatch
RMSProp	Scale mismatch	Momentum / consistent direction
Adam	All three	Generalisation gap vs tuned SGD

Learn it properly SGD → Adam → AdamW