datarekha
Deep Learning Medium Asked at GoogleAsked at DeepMindAsked at Amazon

How do SGD, SGD with momentum, and RMSProp differ, and what does each one fix?

The short answer

Vanilla SGD updates weights by a fixed fraction of the current gradient and oscillates badly in narrow loss valleys. Momentum accumulates a velocity vector that dampens oscillation and accelerates in consistent directions. RMSProp divides the learning rate by a running average of squared gradients per parameter, preventing large-gradient dimensions from dominating and stabilising training on non-stationary objectives.

How to think about it

These three optimizers form a natural progression: each solves a problem the previous one leaves unaddressed.

Vanilla SGD

θ_t = θ_{t-1} − α · g_t

Simple and memory-efficient, but the gradient at a single mini-batch is noisy and the same learning rate α is applied to every parameter. In loss landscapes with very different curvatures along different axes, the LR that avoids diverging in the steep direction makes negligible progress in the shallow direction.

SGD + Momentum

v_t = γ·v_{t-1} + α·g_t
θ_t = θ_{t-1} − v_t

Accumulates a velocity vector v. Consistent gradient directions build up speed; opposing gradients cancel, reducing oscillation. Typical γ=0.9. This is the optimizer behind most record-setting image classification results.

RMSProp

E[g²]_t = ρ·E[g²]_{t-1} + (1−ρ)·g_t²
θ_t = θ_{t-1} − α · g_t / (√E[g²]_t + ε)

Maintains a per-parameter running average of squared gradients. Dividing by this scale normalises updates: parameters that frequently receive large gradients get smaller steps; rarely-updated parameters get larger steps. Designed by Hinton for RNNs where gradient scales vary wildly across time steps.

import torch

model = torch.nn.Linear(256, 10)

# Vanilla SGD
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)

# SGD + momentum (Nesterov variant often better)
opt_momentum = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# RMSProp
opt_rms = torch.optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99, eps=1e-8)

At a glance

OptimizerFixesDoes not fix
SGDNoise, scale mismatch, ravines
+ MomentumNoisy directions, ravinesScale mismatch
RMSPropScale mismatchMomentum / consistent direction
AdamAll threeGeneralisation gap vs tuned SGD
Learn it properly SGD → Adam → AdamW

Keep practising

All Deep Learning questions

Explore further

Skip to content