What is backpropagation and how does the chain rule make it work?
Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.
How to think about it
Chain rule in one line: if L depends on z which depends on w, then:
dL/dw = (dL/dz) · (dz/dw)
Backprop applies this recursively from the output layer back to the input, re-using gradients already computed in higher layers — this is the key efficiency win. Computing all gradients from scratch for each parameter independently would cost O(parameters) forward passes; backprop costs exactly one forward pass + one backward pass.
Layer-by-layer intuition for a dense layer:
# forward
z = W · a_prev + b
a = relu(z)
# backward (gradients flow in)
# dL/da arrives from the layer above
da_dz = (z > 0).float() # ReLU derivative
dL_dz = dL_da * da_dz # chain rule through activation
dL_dW = dL_dz @ a_prev.T # gradient w.r.t. weights
dL_db = dL_dz.sum(dim=0) # gradient w.r.t. bias
dL_da_prev = W.T @ dL_dz # gradient passed to layer below
Each step multiplies the “upstream gradient” dL/da by the local Jacobian of that operation. PyTorch autograd does exactly this, using the computation graph built during the forward pass.
Why one backward pass suffices: because gradients are accumulated as they flow backward, each intermediate gradient is computed once. The total cost is roughly twice a forward pass — one pass to cache activations, one pass to accumulate gradients.
loss = criterion(model(x), y)
loss.backward() # triggers full backprop on the graph
optimizer.step() # applies accumulated gradients to weights
optimizer.zero_grad()