What is backpropagation and how does the chain rule make it work?

Backpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.

Walk me through how backpropagation works.

Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.

Walk me through the forward pass of a neural network end-to-end.

The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

Backpropagation (One Step) — GATE DA

What you'll learn

Backprop is the chain rule applied over a computation graph, right to left

For one path, ∂L/∂w is the product of local derivatives along that path

ReLU's local derivative is 1 when its input was positive, else 0

Computing one partial derivative through a tiny 2-layer net by hand

Last lesson left the obstacle that froze neural networks for twenty years: an MLP carries thousands of weights, and gradient descent needs ∂L/∂w for every one of them — yet the loss reaches each weight only after threading through layer upon layer of mixing and squashing. Computing those gradients one weight at a time, from scratch, is hopeless. The escape is the chain rule from calculus, run backward, and it is called backpropagation.

The idea is to treat the network as a computation graph and notice that the gradient flows like blame down an assembly line. If you know how much the loss blames a layer’s output, the chain rule tells you cheaply how much to blame that layer’s inputs — and you hand that blame back to the previous layer, and repeat, sweeping right to left until every weight has its gradient. One backward pass, every gradient at once. GATE asks you to do this for a single weight through a tiny net, so the whole machine fits on a page.

The chain rule along a path

The forward pass sends values left to right. Backprop sends gradients right to left: each edge carries a local derivative — the derivative of one node with respect to its immediate input — and the gradient of the loss with respect to any quantity is the product of those local derivatives along the path back to it.

Forward arrows build values; the gradient to a weight is the product of local derivatives back along its path.

For a weight w on one path to the loss:

∂L/∂w = (∂L/∂output) · (∂output/∂hidden) · … · (∂·/∂w)

Each factor is a local derivative. The only non-obvious one here is the activation: ReLU’s local derivative is 1 if its input was positive, else 0. It behaves as a gate — passing the upstream gradient through unchanged, or blocking it entirely.

Play with the mechanism on a single neuron — run the forward pass to fill each value, then the backward pass to watch every gradient form as downstream gradient × the local derivative on the edge:

TryBackprop · the chain rule

Backprop is the chain rule, walked backward through the graph

One neuron: z = w·x + b, a = σ(z), L = (a − y)². Edit the inputs, run the forward pass to fill in each value, then the backward pass — watch each gradient form as downstream gradient × the local derivative on the edge. That product, node by node, is backprop.

forward · valuesbackward · gradients

wweight0.6

xinput1.5

bbias-0.3

ytarget1

chain rule

Press Run forward to compute each node's value, then Run backward to watch gradients flow right-to-left. Or Step one node at a time.

loss L0.1256

∂L/∂w—

∂L/∂b—

This is what the optimizer uses: it nudges w ← w − η·∂L/∂w and b ← b − η·∂L/∂b to push the loss downhill.

ready

How GATE asks this

A NAT or MCQ: a tiny network is specified with concrete weights and an input, and you compute one partial derivative (often ∂y/∂w or ∂L/∂w) by the chain rule. The graph is small enough to trace by hand — the skill being tested is identifying the path and multiplying the local derivatives, plus remembering the ReLU gate. This pattern appeared on GATE DA 2025.

Worked example — one chain-rule step

A tiny net: input x = 2. Hidden unit h = ReLU(w₁·x) with w₁ = 1, so h = ReLU(2) = 2. Output y = w₂·h with w₂ = 3, so y = 6. Compute ∂y/∂w₁ by the chain rule.

The path from w₁ to y is w₁ → (w₁x) → h → y. Multiply the local derivative on each edge:

∂y/∂h        = w2 = 3                 (since y = w2·h)
∂h/∂(w1·x)   = 1                      (ReLU input is 2 > 0, so gate = 1)
∂(w1·x)/∂w1  = x = 2                  (linear in w1)

∂y/∂w1 = (∂y/∂h) · (∂h/∂(w1·x)) · (∂(w1·x)/∂w1)
       = 3 · 1 · 2
       = 6

So ∂y/∂w₁ = 6. The ReLU gate was open (input 2 > 0), so it contributed a factor of 1 and simply passed the gradient through — had the ReLU input been negative, the gate would be 0 and the whole gradient would vanish. The same three multiplications in code:

x, w1, w2 = 2.0, 1.0, 3.0

z = w1 * x                       # 2.0  (pre-activation)
h = max(0.0, z)                  # ReLU -> 2.0
y = w2 * h                       # 6.0

dy_dh   = w2                     # 3
dh_dz   = 1.0 if z > 0 else 0.0  # ReLU gate: input 2 > 0 -> 1
dz_dw1  = x                      # 2
dy_dw1  = dy_dh * dh_dz * dz_dw1
print("y      =", y)
print("dy/dw1 =", dy_dw1)

y      = 6.0
dy/dw1 = 6.0

In one breath

Backpropagation is the chain rule walked backward over the network’s computation graph: it sweeps gradients right to left, and the gradient to any weight on a single path is the product (never the sum) of the local derivatives along that path — ∂L/∂w = (∂L/∂out)·(∂out/∂h)·…·(∂·/∂w) — where the only tricky factor is the activation, since ReLU’s local derivative is a gate, 1 when its input was positive and 0 otherwise, so a unit whose pre-activation was ≤ 0 kills the gradient flowing through it.

Practice

Quick check

0/6

Q1Recall — Which statements about a single backprop step are TRUE? (select all that apply)select all that apply

Q2Recall — In the net x → (w₁·x) → ReLU → h → (w₂·h) → y, why can ∂y/∂w₁ be exactly 0 even when w₂ ≠ 0 and x ≠ 0?

Q3Trace — For y = w₂·h with w₂ = 7, what is the local derivative ∂y/∂h?numerical answer — type a number

Q4Trace — A net: x = 3, h = ReLU(w₁·x) with w₁ = 2 (so h = ReLU(6) = 6), y = w₂·h with w₂ = 4. Compute ∂y/∂w₁.numerical answer — type a number

Q5Trace — Same architecture, but now the ReLU input is negative: x = −1, w₁ = 2 (pre-activation = −2), y = w₂·h with w₂ = 5. Compute ∂y/∂w₁.numerical answer — type a number

Q6Apply — A net: x = 1, h = ReLU(w₁·x) with w₁ = 4 (h = 4), y = w₂·h with w₂ = 2 (y = 8). Compute ∂y/∂w₂.numerical answer — type a number

A question to carry forward

With backprop, the whole supervised machine is complete: features and labels go in, gradients flow back, weights settle, a trained predictor comes out. Look back over every model since the start of this chapter — regression, the classifiers, the neural net — and one ingredient was always present, quietly indispensable: the labels. The answer key. Every method learned by comparing its guess against a known truth.

Now take the answer key away. You are handed a pile of points — customers, pixels, documents — and nothing tells you which belongs with which. Yet you suspect they fall into natural groups: a few customer types, a handful of topics. Here is the thread onward into unsupervised learning: with no labels at all to guide you, how could an algorithm discover those groups on its own — what would it even mean for a grouping to be “good,” and how do you find one by alternately guessing the groups and refining them?

Backpropagation (One Step)

What you'll learn

Before you start

The chain rule along a path

Backprop is the chain rule, walked backward through the graph

How GATE asks this

Worked example — one chain-rule step

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further