datarekha

Backprop by hand

backward() is not magic — it's the chain rule applied mechanically over a graph. Build a 30-line autograd engine from scratch and watch gradients flow.

9 min read Intermediate Deep Learning Lesson 3 of 27

What you'll learn

  • Why backprop is just the chain rule applied node by node
  • How to build a scalar autograd engine (a tiny micrograd)
  • Why gradients are accumulated with += and computed in reverse topological order

Before you start

In the training loop, loss.backward() quietly filled .grad on every parameter, and we trusted it. That trust is the most common source of silent training bugs — because when you don’t know what backward does, you can’t tell why it broke. So let’s remove the magic. By the end of this lesson you’ll have written the whole engine yourself, in about thirty lines, and backward() will never feel mysterious again.

The one idea you need: backpropagation is the chain rule, applied mechanically, one operation at a time, from the loss backward to the inputs.

The chain rule, as a graph

Every computation is a graph: inputs flow through operations to a final number (the loss). The forward pass computes each node’s value. The backward pass computes each node’s gradient — how much the loss would change if that node nudged a little.

The trick that makes it tractable: to get a node’s gradient, you only need two things — the gradient of the node downstream of it, and the local derivative of the operation on the edge between them. Multiply them. That’s the chain rule:

node.grad  =  downstream.grad  ×  (local derivative on the edge)

Walk that rule from the loss backward through every edge and you’ve computed every gradient in the graph. Step through it on a single neuron below — run the forward pass, then the backward pass, and watch each gradient appear as the downstream gradient times the local derivative on each edge.

Build the engine

Here is a complete scalar autograd engine — Karpathy’s micrograd, distilled. Each Value remembers the children that produced it and a tiny _backward closure that knows its local derivative. Calling .backward() on the final node walks the graph in reverse and runs every closure. Run it — it computes the same gradients PyTorch would.

Notice w.grad is 0. That’s not a bug — z was negative, ReLU output 0, and ReLU’s derivative is 0 there, so no gradient flows back through w. That is a dead ReLU, and seeing it fall out of the mechanics is the whole point: backprop isn’t a formula you memorize, it’s a graph walk you can reason about.

nodeop (e.g. ×)lossgrad flows back ← multiply by local derivativenode.grad += downstream.grad × local_derivative
Each edge multiplies the downstream gradient by its local derivative; a node sums contributions from every path it feeds.

Two details that trip everyone up

This is exactly what PyTorch does

Swap scalars for tensors and _backward closures for optimized C++/CUDA kernels, and you have torch.autograd. The graph is built as you run the forward pass (define-by-run), and loss.backward() walks it in reverse — the same algorithm you just wrote. PyTorch exists to do this fast and on GPUs, not to do something different.

# the engine you just built, in PyTorch:
x = torch.tensor(2.0, requires_grad=True)
y = (3 * x ** 2).relu()
y.backward()       # walks the graph in reverse, fills x.grad
print(x.grad)      # dy/dx = 12.0

Quick check

Quick check

0/3
Q1In backprop, how is a single node's gradient computed?
Q2Why does each _backward use += to update a child's .grad instead of =?
Q3Why does backward() process nodes in reverse topological order?

Next

You can now read any gradient bug in the face. Next: the choices that decide whether those gradients actually helpweight initialization (so the first gradients aren’t dead or exploding) and vanishing & exploding gradients (when the chain rule multiplies too many small or large numbers together).

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions

Related lessons

Explore further

Skip to content