Backprop by hand
backward() is not magic — it's the chain rule applied mechanically over a graph. Build a 30-line autograd engine from scratch and watch gradients flow.
What you'll learn
- Why backprop is just the chain rule applied node by node
- How to build a scalar autograd engine (a tiny micrograd)
- Why gradients are accumulated with += and computed in reverse topological order
Before you start
In the training loop, loss.backward() quietly
filled .grad on every parameter, and we trusted it. That trust is the most
common source of silent training bugs — because when you don’t know what
backward does, you can’t tell why it broke. So let’s remove the magic. By the
end of this lesson you’ll have written the whole engine yourself, in about
thirty lines, and backward() will never feel mysterious again.
The one idea you need: backpropagation is the chain rule, applied mechanically, one operation at a time, from the loss backward to the inputs.
The chain rule, as a graph
Every computation is a graph: inputs flow through operations to a final number (the loss). The forward pass computes each node’s value. The backward pass computes each node’s gradient — how much the loss would change if that node nudged a little.
The trick that makes it tractable: to get a node’s gradient, you only need two things — the gradient of the node downstream of it, and the local derivative of the operation on the edge between them. Multiply them. That’s the chain rule:
node.grad = downstream.grad × (local derivative on the edge)
Walk that rule from the loss backward through every edge and you’ve computed every gradient in the graph. Step through it on a single neuron below — run the forward pass, then the backward pass, and watch each gradient appear as the downstream gradient times the local derivative on each edge.
Build the engine
Here is a complete scalar autograd engine — Karpathy’s micrograd, distilled.
Each Value remembers the children that produced it and a tiny _backward
closure that knows its local derivative. Calling .backward() on the final
node walks the graph in reverse and runs every closure. Run it — it computes
the same gradients PyTorch would.
Notice w.grad is 0. That’s not a bug — z was negative, ReLU output 0,
and ReLU’s derivative is 0 there, so no gradient flows back through w. That
is a dead ReLU, and seeing it fall out of the mechanics is the whole point:
backprop isn’t a formula you memorize, it’s a graph walk you can reason about.
Two details that trip everyone up
This is exactly what PyTorch does
Swap scalars for tensors and _backward closures for optimized C++/CUDA
kernels, and you have torch.autograd. The graph is built as you run the
forward pass (define-by-run), and loss.backward() walks it in reverse — the
same algorithm you just wrote. PyTorch exists to do this fast and on GPUs, not
to do something different.
# the engine you just built, in PyTorch:
x = torch.tensor(2.0, requires_grad=True)
y = (3 * x ** 2).relu()
y.backward() # walks the graph in reverse, fills x.grad
print(x.grad) # dy/dx = 12.0
Quick check
Quick check
Next
You can now read any gradient bug in the face. Next: the choices that decide whether those gradients actually help — weight initialization (so the first gradients aren’t dead or exploding) and vanishing & exploding gradients (when the chain rule multiplies too many small or large numbers together).
Practice this in an interview
All questionsBackpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.
Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.
PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.
Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.