Backpropagation (One Step)
Backprop is the chain rule walked backward over a computation graph. GATE asks for one partial derivative through a small net — here is the exact recipe.
What you'll learn
- Backprop is the chain rule applied over a computation graph, right to left
- For one path, ∂L/∂w is the product of local derivatives along that path
- ReLU's local derivative is 1 when its input was positive, else 0
- Computing one partial derivative through a tiny 2-layer net by hand
Before you start
Training a network means nudging every weight in the direction that lowers the
loss — which needs ∂L/∂w for each weight w. Backpropagation computes
those gradients efficiently by treating the network as a computation graph
and applying the chain rule backward, layer by layer, from the loss toward
the inputs.
The chain rule along a path
The forward pass sends values left to right. Backprop sends gradients right to left: each edge carries a local derivative, and the gradient of the loss with respect to any quantity is the product of those local derivatives along the path back to it.
For a weight w on one path to the loss:
∂L/∂w = (∂L/∂output) · (∂output/∂hidden) · … · (∂·/∂w)
Each factor is a local derivative — the derivative of one node with respect to its immediate input. The only non-obvious one here is the activation: ReLU’s local derivative is 1 if its input was positive, else 0. It acts as a gate that either passes the upstream gradient through unchanged or blocks it.
Play with the mechanism on a single neuron — run the forward pass to fill each value, then the backward pass to watch every gradient form as downstream gradient × the local derivative on the edge:
How GATE asks this
A NAT or MCQ: a tiny network is specified with concrete weights and an
input, and you compute one partial derivative (often ∂y/∂w or ∂L/∂w) by
the chain rule. The graph is small enough to trace by hand — the skill being
tested is identifying the path and multiplying the local derivatives, plus
remembering the ReLU gate. This pattern appeared on GATE DA 2025.
Worked example — one chain-rule step
A tiny net: input
x = 2. Hidden unith = ReLU(w₁·x)withw₁ = 1, soh = ReLU(2) = 2. Outputy = w₂·hwithw₂ = 3, soy = 6. Compute∂y/∂w₁by the chain rule.
The path from w₁ to y is w₁ → (w₁x) → h → y. Multiply the local
derivative on each edge:
∂y/∂h = w2 = 3 (since y = w2·h)
∂h/∂(w1·x) = 1 (ReLU input is 2 > 0, so gate = 1)
∂(w1·x)/∂w1 = x = 2 (linear in w1)
∂y/∂w1 = (∂y/∂h) · (∂h/∂(w1·x)) · (∂(w1·x)/∂w1)
= 3 · 1 · 2
= 6
So ∂y/∂w₁ = 6. Notice the ReLU gate was open (input 2 > 0), so it
contributed a factor of 1 and simply passed the gradient through. Had the
ReLU input been negative, the gate would be 0 and the whole gradient would
vanish.
Quick check
Quick check
Practice this in an interview
All questionsBackpropagation is the algorithm that computes the gradient of the loss with respect to every parameter by applying the chain rule layer by layer in reverse. It turns a single backward pass through the computation graph into exact gradients for all weights simultaneously.
The forward pass feeds an input through every layer in sequence: each layer computes a linear transform followed by an activation, caching the intermediate values needed later for backpropagation. The final layer produces a prediction, which is compared to the label via a loss function.
Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.
An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.