Walk me through how backpropagation works.

Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.

What is backpropagation and how does the chain rule make it work?

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.

In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

Backprop by hand — Deep Learning

In the training loop, loss.backward() quietly filled .grad on every parameter, and we trusted it. That trust is the most common source of silent training bugs — because when you don’t know what backward does, you can’t tell why it broke. So let’s remove the magic. By the end of this lesson you’ll have written the whole engine yourself, in about thirty lines, and backward() will never feel mysterious again.

The one idea you need: backpropagation is the chain rule, applied mechanically, one operation at a time, from the loss backward to the inputs.

TryBackprop · the chain rule

Backprop is the chain rule, walked backward through the graph

One neuron: z = w·x + b, a = σ(z), L = (a − y)². Edit the inputs, run the forward pass to fill in each value, then the backward pass — watch each gradient form as downstream gradient × the local derivative on the edge. That product, node by node, is backprop.

forward · valuesbackward · gradients

wweight0.6

xinput1.5

bbias-0.3

ytarget1

chain rule

Press Run forward to compute each node's value, then Run backward to watch gradients flow right-to-left. Or Step one node at a time.

loss L0.1256

∂L/∂w—

∂L/∂b—

This is what the optimizer uses: it nudges w ← w − η·∂L/∂w and b ← b − η·∂L/∂b to push the loss downhill.

ready

The chain rule, as a graph

Every computation is a graph: inputs flow through operations to a final number (the loss). The forward pass computes each node’s value. The backward pass computes each node’s gradient — how much the loss would change if that node nudged a little.

The trick that makes it tractable: to get a node’s gradient, you only need two things — the gradient of the node downstream of it, and the local derivative of the operation on the edge between them. Multiply them. That’s the chain rule:

node.grad  =  downstream.grad  ×  (local derivative on the edge)

Walk that rule from the loss backward through every edge and you’ve computed every gradient in the graph. Let’s make it concrete: build the engine, run it on a single neuron, and watch the gradients fall out of the mechanics.

Build the engine

Here is a complete scalar autograd engine — Karpathy’s micrograd, distilled. Each Value remembers the children that produced it and a tiny _backward closure that knows its local derivative. Calling .backward() on the final node walks the graph in reverse and runs every closure — and it computes the same gradients PyTorch would:

class Value:
    """A single scalar that tracks how it was computed, so it can backprop."""
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0                 # dL/dthis — filled by backward()
        self._backward = lambda: None   # how to push grad to my inputs
        self._prev = set(_children)

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other))
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1  → just pass the grad through
            self.grad  += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other))
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad  += other.data * out.grad
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out

    def relu(self):
        out = Value(self.data if self.data > 0 else 0.0, (self,))
        def _backward():
            # derivative of relu is 1 where input > 0, else 0
            self.grad += (1.0 if self.data > 0 else 0.0) * out.grad
        out._backward = _backward
        return out

    def backward(self):
        # 1) topological order: every node after its inputs
        topo, seen = [], set()
        def build(v):
            if v not in seen:
                seen.add(v)
                for child in v._prev:
                    build(child)
                topo.append(v)
        build(self)
        # 2) seed: dL/dL = 1, then run closures in reverse
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

# A tiny network: L = relu(w*x + b), then squared error to target 1.0
x, w, b = Value(2.0), Value(-3.0), Value(1.0)
z = w * x + b            # = -5
a = z.relu()             # = 0  (relu killed it)
loss = (a + Value(-1.0)) * (a + Value(-1.0))   # (a - 1)^2

loss.backward()
print(f"forward:  z={z.data}  a={a.data}  loss={loss.data}")
print(f"grads:    dL/dw={w.grad}  dL/dx={x.grad}  dL/db={b.grad}")

forward:  z=-5.0  a=0.0  loss=1.0
grads:    dL/dw=0.0  dL/dx=0.0  dL/db=0.0

Notice w.grad is 0. That’s not a bug — z was negative, ReLU output 0, and ReLU’s derivative is 0 there, so no gradient flows back through w. That is a dead ReLU, and seeing it fall out of the mechanics is the whole point: backprop isn’t a formula you memorize, it’s a graph walk you can reason about.

Each edge multiplies the downstream gradient by its local derivative; a node sums contributions from every path it feeds.

Two details that trip everyone up

This is exactly what PyTorch does

Swap scalars for tensors and _backward closures for optimized C++/CUDA kernels, and you have torch.autograd. The graph is built as you run the forward pass (define-by-run), and loss.backward() walks it in reverse — the same algorithm you just wrote. PyTorch exists to do this fast and on GPUs, not to do something different.

# the engine you just built, in PyTorch:
import torch

x = torch.tensor(2.0, requires_grad=True)
y = (3 * x ** 2).relu()
y.backward()       # walks the graph in reverse, fills x.grad
print(x.grad)

tensor(12.)

Exactly the number our scalar engine produced by hand: d(3x²)/dx = 6x = 12 at x = 2, with the ReLU passing it through unchanged because its input is positive.

In one breath

Backpropagation is the chain rule applied mechanically, one operation at a time, from the loss backward to the inputs.
Each node’s gradient = the downstream gradient × the local derivative on the edge; a node feeding several paths sums them — which is why every _backward uses +=.
backward() runs the per-op closures in reverse topological order, so a node’s gradient is read only once everything downstream of it is final.
A dead ReLU passes zero gradient: with a negative input, nothing flows back through it — you can read that straight off the graph.
Swap scalars for tensors and the closures for CUDA kernels and you have torch.autograd — the same algorithm, built for speed.

Quick check

0/3

Q1In backprop, how is a single node's gradient computed?

Q2Why does each _backward use += to update a child's .grad instead of =?

Q3Why does backward() process nodes in reverse topological order?

You can now read any gradient bug in the face. Next: the choices that decide whether those gradients actually help — weight initialization (so the first gradients aren’t dead or exploding) and vanishing & exploding gradients (when the chain rule multiplies too many small or large numbers together).

Backprop by hand

What you'll learn

Before you start

Backprop is the chain rule, walked backward through the graph

The chain rule, as a graph

Build the engine

Two details that trip everyone up

This is exactly what PyTorch does

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further