The training loop
Forward, loss, backward, step, zero-grad — the five-line ritual that turns a pile of layers into a model that learns. Built and run from scratch.
What you'll learn
- The five steps every training loop runs, and what each one does
- Why forgetting zero_grad() silently breaks training
- The epoch / batch structure and the train vs eval split
Before you start
You’ve met the parts — tensors, autograd, activations, a loss function. But a pile of parts is not a model that learns. The thing that turns parts into learning is a short ritual you will type ten thousand times in your career, and it is only five steps long:
1. forward pred = model(x) # run the network
2. loss L = loss_fn(pred, y) # how wrong was it?
3. backward L.backward() # autograd fills every .grad
4. step optimizer.step() # nudge each weight downhill
5. zero optimizer.zero_grad() # wipe grads for the next round
Steps 1–2 are the forward pass — predict, then score. Steps 3–5 are the backward pass — find the slope of the loss with respect to every weight, take one small step down that slope, then reset. Run that loop enough times on enough data and the weights drift to values that make the loss small. That drift is learning.
Step through one iteration at a time. Watch the loss fall as the weights move —
then flip off zero_grad() and watch the gradient pile up and blow training
apart.
Build it for real (in NumPy)
PyTorch’s loss.backward() is convenient, but the loop has no magic in it.
Here is the entire ritual on a tiny linear-regression problem, with the
gradient computed by hand so you can see exactly what step() consumes. Run it
and watch the loss drop each epoch.
The loss falls, and w and b crawl toward 2 and 1. That is the whole of
supervised learning — a loop that keeps nudging parameters in the direction that
shrinks the loss.
The same loop in PyTorch
In real code, autograd computes the gradients for you and an optimizer holds the update rule. The five steps map one-to-one:
model = nn.Linear(1, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()
for epoch in range(20):
for xb, yb in loader: # one batch at a time
pred = model(xb) # 1. forward
loss = loss_fn(pred, yb) # 2. loss
loss.backward() # 3. backward — fills p.grad for every param
opt.step() # 4. step — uses p.grad to update p
opt.zero_grad() # 5. zero — reset .grad to 0 for next batch
Epochs, batches, and the train/val split
Two structural details turn the bare loop into real training:
- Batches and epochs. You rarely feed all data at once. You split it into batches (say 32 examples), run the five steps per batch, and one full pass over the dataset is one epoch. You train for many epochs. Batch size and learning rate interact in ways worth their own lesson.
- Train vs validation. You train on one split and watch a held-out validation split to catch overfitting. Two switches matter here:
model.train() # dropout/BatchNorm in TRAINING mode
# ... training loop ...
model.eval() # dropout off, BatchNorm uses running stats
with torch.no_grad(): # don't build the autograd graph — faster, less memory
val_loss = loss_fn(model(x_val), y_val)
Quick check
Quick check
Next
You now have the spine every other lesson hangs on. Next we open up step 3 —
backprop by hand — to see exactly how
backward() computes those gradients, then study the choices that make the loop
converge: weight initialization,
optimizers, and
learning-rate schedules.
Practice this in an interview
All questionsPyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.
Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.
The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.
Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.