In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.

Walk me through how backpropagation works.

Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.

Walk me through the forward pass of a neural network end-to-end.

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

What is backpropagation and how does the chain rule make it work?

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.

The training loop — Deep Learning

You’ve met the parts — tensors, autograd, activations, a loss function. But a pile of parts is not a model that learns. The thing that turns parts into learning is a short ritual you will type ten thousand times in your career, and it is only five steps long:

1. forward    pred = model(x)            # run the network
2. loss       L = loss_fn(pred, y)       # how wrong was it?
3. backward   L.backward()               # autograd fills every .grad
4. step       optimizer.step()           # nudge each weight downhill
5. zero       optimizer.zero_grad()      # wipe grads for the next round

Steps 1–2 are the forward pass — predict, then score. Steps 3–5 are the backward pass — find the slope of the loss with respect to every weight, take one small step down that slope, then reset. Run that loop enough times on enough data and the weights drift to values that make the loss small. That drift is learning.

One iteration: forward to a loss, backward to gradients, step the weights, zero the grads, repeat.

TryThe training loop · one step at a time

Watch the loss fall — then break it on purpose

A model learning y = 2x + 1 from w = 0, b = 0. Click Step to advance one micro-step, or run whole iterations. Then flip zero_grad off and watch the gradient buffer accumulate until training explodes.

1forwardpred = w·x + b

2lossmean (pred − y)²

3backwardfill .grad

4stepw −= lr·grad

5zero_gradreset .grad

w0.000→ 2.000

b0.000→ 1.000

loss2.945iter 0

.grad0.000, 0.000dw, db

loss over iterations

lr 0.30

Each iteration: forward predicts, lossscores, backward fills .grad, step nudges the weights, zero_grad resets. Keep stepping and watch w → 2, b → 1.

Build it for real (in NumPy)

PyTorch’s loss.backward() is convenient, but the loop has no magic in it. Here is the entire ritual on a tiny linear-regression problem, with the gradient computed by hand so you can see exactly what step() consumes. Run it and watch the loss drop each epoch.

import numpy as np

rng = np.random.default_rng(0)

# Toy data: y = 2x + 1 with a little noise. The model must discover w=2, b=1.
X = rng.uniform(-1, 1, size=(64, 1))
y = 2.0 * X + 1.0 + 0.05 * rng.standard_normal((64, 1))

# Parameters we will learn (the "model"): start at zero.
w = np.zeros((1, 1))
b = np.zeros((1, 1))
lr = 0.1

for epoch in range(20):
    # 1. forward: prediction
    pred = X @ w + b
    # 2. loss: mean squared error
    loss = np.mean((pred - y) ** 2)
    # 3. backward: gradient of MSE wrt w and b (this is what autograd computes)
    grad = 2 * (pred - y) / len(X)        # dL/dpred
    dw = X.T @ grad
    db = grad.sum(axis=0, keepdims=True)
    # 4. step: move each parameter a little downhill
    w -= lr * dw
    b -= lr * db
    # 5. zero_grad happens automatically here — we recompute dw, db next loop
    if epoch % 4 == 0:
        print(f"epoch {epoch:2d}  loss {loss:.4f}  w {w[0,0]:.3f}  b {b[0,0]:.3f}")

print(f"\nlearned  w={w[0,0]:.3f}  b={b[0,0]:.3f}   (target w=2, b=1)")

epoch  0  loss 2.3581  w 0.137  b 0.195
epoch  4  loss 0.9606  w 0.600  b 0.660
epoch  8  loss 0.4799  w 0.950  b 0.853
epoch 12  loss 0.2606  w 1.214  b 0.935
epoch 16  loss 0.1459  w 1.412  b 0.970

learned  w=1.528  b=0.983   (target w=2, b=1)

The loss falls, and w and b crawl toward 2 and 1. That is the whole of supervised learning — a loop that keeps nudging parameters in the direction that shrinks the loss.

The same loop in PyTorch

In real code, autograd computes the gradients for you and an optimizer holds the update rule. The five steps map one-to-one:

model = nn.Linear(1, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()

for epoch in range(20):
    for xb, yb in loader:          # one batch at a time
        pred = model(xb)           # 1. forward
        loss = loss_fn(pred, yb)   # 2. loss
        loss.backward()            # 3. backward — fills p.grad for every param
        opt.step()                 # 4. step  — uses p.grad to update p
        opt.zero_grad()            # 5. zero  — reset .grad to 0 for next batch

Epochs, batches, and the train/val split

Two structural details turn the bare loop into real training:

Batches and epochs. You rarely feed all data at once. You split it into batches (say 32 examples), run the five steps per batch, and one full pass over the dataset is one epoch. You train for many epochs. Batch size and learning rate interact in ways worth their own lesson.
Train vs validation. You train on one split and watch a held-out validation split to catch overfitting. Two switches matter here:

model.train()                  # dropout/BatchNorm in TRAINING mode
# ... training loop ...

model.eval()                   # dropout off, BatchNorm uses running stats
with torch.no_grad():          # don't build the autograd graph — faster, less memory
    val_loss = loss_fn(model(x_val), y_val)

In one breath

Every training loop is five steps: forward (predict) → loss (score) → backward (gradients) → step (nudge weights downhill) → zero_grad (reset).
Repeated over batches and epochs, that drift of the weights toward a smaller loss is learning.
zero_grad is mandatory: backward adds to .grad, so skipping it sums gradients across steps and the loss diverges to NaN — with no error message.
Train on one split and watch a held-out validation split; switch to model.eval() + torch.no_grad() to evaluate without dropout or graph overhead.
When a model won’t learn, overfit a single batch to ~0 loss first — if it can’t, the bug is in the loop, data, or shapes, not the hyperparameters.

Quick check

0/3

Q1What does optimizer.zero_grad() do, and why is it needed?

Q2What is the correct order of the five training-loop steps?

Q3Why wrap validation in `with torch.no_grad()` and call model.eval()?

You now have the spine every other lesson hangs on. Next we open up step 3 — backprop by hand — to see exactly how backward() computes those gradients, then study the choices that make the loop converge: weight initialization, optimizers, and learning-rate schedules.

The training loop

What you'll learn

Before you start

Watch the loss fall — then break it on purpose

Build it for real (in NumPy)

The same loop in PyTorch

Epochs, batches, and the train/val split

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further