Backprop is the chain rule with good bookkeeping

There is a moment in every intro neural-network course where the instructor draws a network, writes down a loss function, and then says something like “and then we run backpropagation to compute the gradients.” The word backpropagation is delivered as though it names a proprietary process, something that emerged from deep research and lives inside the framework. Students take it on faith. That faith is unnecessary.

Backpropagation is the chain rule. That is the entire algorithm. If you took calculus and remember the chain rule — that the derivative of a composed function is the product of derivatives — you already know backprop conceptually. What the algorithm adds is not new mathematics but careful bookkeeping: a specific order of operations that ensures you compute every local derivative once and reuse it everywhere it is needed.

Understanding this is not just satisfying. It changes how you debug training. When a network fails to learn, the question becomes “where does the gradient die, and why?” That question only makes sense if you see the network as a chain of functions, not a black box.

What training actually needs

A neural network is a function with millions of knobs — its weights. Training is optimization: adjust the knobs until the network’s output matches what you want. To do that with gradient descent (the dominant approach), you need to know, for each weight, how much the loss changes when that weight changes by a tiny amount. That quantity is the weight’s gradient.

The problem is that a weight in layer two of a ten-layer network affects the loss only indirectly — through layers three through ten. Computing its gradient requires you to trace how a perturbation in that weight ripples forward through every subsequent layer until it finally disturbs the output and therefore the loss.

The chain rule is the mathematical tool for tracing that ripple.

The chain rule, stated plainly

If y depends on u, and u depends on x, then the rate at which y changes with x equals the rate at which y changes with u multiplied by the rate at which u changes with x. In symbols: dy/dx = (dy/du) * (du/dx).

Stack three functions instead of two, and the chain extends: dy/dx = (dy/dz) * (dz/du) * (du/dx). Stack fifty functions — as in a deep network — and the chain extends fifty terms long.

A neural network is exactly this: a composition of functions. Each layer takes the previous layer’s output, multiplies by a weight matrix, adds a bias, applies a nonlinearity (like ReLU or sigmoid — functions that introduce the curvature without which the network could only learn linear relationships), and passes the result forward. The loss function at the end takes the final output and returns a single number representing how wrong the network is.

To train the network, you need the derivative of that final number with respect to every weight. The chain rule gives it to you. Backpropagation is the algorithm for applying the chain rule efficiently to this particular structure.

The forward pass is not just inference

When a network makes a prediction, it runs a forward pass: input enters at layer one, gets transformed, the result flows into layer two, gets transformed again, and so on until an output emerges. Most people understand this.

What is easy to miss is that the forward pass during training has a second job: it must save, at each layer, the intermediate values it computes. These cached values — called activations, meaning the output of each layer before it becomes the input to the next — are not kept for their own sake. They are kept because the backward pass will need them.

Here is why. When you compute the local derivative at a layer during the backward pass, the formula for that derivative almost always involves the layer’s input or output. For a ReLU nonlinearity, the derivative is one where the input was positive and zero where it was not — so to compute the derivative, you need to know what the input was. For a weight matrix, the derivative with respect to the weight involves the activation that entered that layer from the left. Without caching these values during the forward pass, you would have to recompute them during the backward pass — re-running the entire forward pass for every weight. That would be catastrophic for performance.

Caching is the bookkeeping that makes backprop tractable.

The forward pass flows left to right, caching each layer’s activation. The backward pass flows right to left, multiplying local derivatives — consuming the cached values rather than recomputing them.

The backward pass is gradient accounting

Once the forward pass is done and the loss is computed, training flips direction. The loss with respect to itself is one — trivially. Now you work backward, layer by layer, computing the gradient of the loss with respect to each layer’s output, and from that, the gradient with respect to each layer’s weights.

The key insight is that the gradient entering a layer from the right (the gradient of the loss with respect to this layer’s output) tells you how much the loss cares about perturbations in this layer’s output. To get the gradient with respect to this layer’s input — which becomes the incoming gradient for the layer to the left — you multiply by the local derivative of this layer’s transformation. That is the chain rule: the gradient flows backward by being multiplied, one layer at a time, by local derivatives.

At each layer, you also compute the gradient with respect to the weights. That computation uses the cached activation from the forward pass — the value that entered this layer from the left — multiplied by the incoming gradient from the right. Now you see why the cache matters: the gradient with respect to a weight is a product of two things, one from the backward signal and one from the forward pass. Without the cached activation, you cannot compute it without re-running the forward pass for that layer.

This is the whole algorithm. Backward pass: for each layer from last to first, (1) compute gradient with respect to weights using cached input and incoming gradient; (2) compute gradient with respect to input using local derivative and incoming gradient; (3) pass gradient with respect to input leftward as the incoming gradient for the next layer.

Why the gradients can vanish

Once you see backprop as a chain of multiplications, one of the most notorious failure modes — the vanishing gradient — becomes obvious.

If the local derivative at each layer is a number smaller than one, then multiplying many of them together produces a product that shrinks exponentially. A hundred layers, each with a local derivative of 0.5, produces an incoming gradient at layer one that is 0.5 raised to the 100th power — effectively zero. The weights in the early layers receive a gradient so small it might as well be noise. They stop learning. This is the vanishing gradient problem that plagued deep networks for years.

The sigmoid activation function is a classic culprit. Its derivative peaks at 0.25 and collapses toward zero for large inputs. ReLU (which passes positive values unchanged and clamps negatives to zero) has a derivative of one for positive inputs — which stops the exponential decay. This single observation explains why ReLU largely replaced sigmoid in hidden layers. It is not that ReLU is philosophically superior; it is that its gradient does not multiply by a fraction at every step.

Residual connections — the skip connections in ResNet and every transformer — address the same problem from another angle. By adding a layer’s input directly to its output, they create a gradient superhighway: the gradient can flow backward through the addition without being multiplied by anything, bypassing the layer entirely. The layer’s weights still receive a gradient through the normal chain, but the early layers never see a gradient that has been multiplied through hundreds of functions without relief.

With sigmoid activations, each local derivative is at most 0.25, so gradients shrink exponentially across layers. ReLU’s derivative is 1 for positive inputs, keeping gradient magnitude roughly stable through depth.

The cost of a backward pass

A natural question: how expensive is this compared to the forward pass?

In terms of computation, a backward pass costs roughly two to three times a forward pass, depending on the architecture. The factor of two appears because you are computing derivatives in addition to the original function values, and for most operations the derivative requires similar arithmetic to the function itself. The extra fraction comes from operations like the outer product when computing weight gradients.

In terms of memory, the backward pass is more expensive than it first appears. You must hold all cached activations in memory simultaneously — because the backward pass, working from last to first, needs each layer’s cached activation when it arrives at that layer. For a deep network with large batch sizes and large activations, this memory footprint dominates GPU memory consumption during training. Gradient checkpointing (a technique where you recompute activations on demand instead of caching all of them, trading compute for memory) exists precisely to manage this.

This is why inference is cheaper than training. Inference runs only the forward pass, stores nothing, and throws away activations as soon as they are no longer needed. Training must cache everything.

What backprop is not

Backpropagation does not explain how the network learns in any deep sense — it only computes gradients. Gradient descent (or Adam, or any other optimizer — the optimizers are the algorithms that use gradients to update weights) is the thing that actually moves the weights. Backprop is input to the optimizer, not the optimizer itself.

Backpropagation also does not prove that the network will converge, or converge to a good solution. Loss landscapes for deep networks are complicated. The gradients are exact (no approximation there) but gradient descent in a high-dimensional, non-convex landscape is guided by local information only. The gradients tell you which way is downhill right now; they do not tell you whether that valley is good or a trap.

And backpropagation did not originate with deep learning. Versions of the algorithm appear in control theory going back to the 1960s. The reason it became foundational to neural networks is Rumelhart, Hinton, and Williams’s 1986 paper, which applied it to multilayer networks and showed it was computationally tractable. That paper was less a mathematical discovery than a demonstration: this thing you already know how to do, it works here, it scales, try it.

Forty years later the core computation has not changed. What changed is that someone built efficient reverse-mode automatic differentiation engines (PyTorch’s autograd, JAX’s grad) that do the bookkeeping automatically for arbitrary computational graphs. You define the forward pass; the framework builds the backward pass by recording operations and applying the chain rule in reverse. You do not write backprop anymore; you write a forward pass and let the framework backpropagate for you.

But the framework is doing exactly what this essay described. A chain of functions. Cached intermediate values. Local derivatives multiplied together from right to left. The chain rule, systematically applied, with good bookkeeping.

That is all it ever was.

The practical intuitions that survive

If you debug a training loop, you want three things in your head.

First: gradients flow backward by being multiplied, not added. This means a zero anywhere in the chain kills everything to its left — dead neurons, saturated activations, and bad initialization all manifest as zeroed-out gradients in early layers.

Second: the gradient with respect to a weight depends on the activation that entered that layer from the left. Large activations amplify gradients; small activations suppress them. Normalization layers (batch norm, layer norm) exist partly to keep activations in a range where gradients stay manageable.

Third: the forward pass is not just inference, it is also memory allocation for the backward pass. If you are out of memory during training, you are holding too many activations. Gradient checkpointing trades recomputation for memory. Smaller batch sizes reduce activation memory at the cost of noisier gradients.

These are not framework-specific tricks. They follow directly from what backpropagation actually is: the chain rule, run backward, using stored values from the forward pass. Once that picture is clear, the rest follows from first principles.