How do LSTM gates solve the vanishing gradient problem?

An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.

Walk me through how backpropagation works.

Backpropagation computes the gradient of the loss with respect to every parameter by applying the chain rule backward through the network, reusing intermediate results from the forward pass. These gradients are then used by an optimizer to update the weights via gradient descent.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

What is gradient accumulation and when do you need it?

Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.

Gradient Descent (One Step) — GATE DA

What you'll learn

The update rule: w ← w − η·(∂L/∂w) — step against the gradient

What the learning rate η controls, and why too large diverges while too small crawls

For squared loss L = ½(wx − y)², the gradient is (wx − y)·x

Performing one gradient-descent update by hand to a clean number

Last lesson left us needing a second way down to the best weights — one that never inverts a matrix, never even builds XᵀX, but simply creeps toward the bottom of the cost bowl. Picture exactly that situation literally. You are standing somewhere on a smooth valley in thick fog. You cannot see the bottom, but you can feel which way the ground tilts under your feet. So you take a small step in the steepest downhill direction, feel the slope again, and step again. Keep going and the valley floor pulls you in.

That fog-walk is gradient descent, and it answers all three of last lesson’s questions at once: the slope under your feet is the gradient, which tells you which way is downhill; the size of your step is a number you choose; and the rule that turns “step downhill” into arithmetic is a single line. The gradient ∂L/∂w points uphill, toward steeper loss — so to go down you step the opposite way, against it. The same single move, repeated millions of times, is what trains every neural network you have ever used.

The update rule

New weight = old weight, nudged downhill by a fraction η of the slope.

Three pieces, and nothing more at exam level.

Gradient ∂L/∂w — the slope of the loss with respect to the weight. A positive slope means loss rises as w rises, so we should decrease w — and the minus sign does that for us automatically.
Learning rate η (eta) — a small positive number, the step size. It scales how far you move on each update.
Subtraction — descent moves against the gradient. Step with its sign and you climb the loss instead (gradient ascent); that flipped sign is the classic error.

Drop a ball onto this loss bowl, change the learning rate, and step it downhill by hand. Watch a small η crawl, and a large one bounce clear across the valley:

Trygradient descent

Click to drop a ball — watch it roll downhill

Click anywhere to drop the ball there. target (2, −1)

x-4.000

y3.000

loss100.000

steps0

learning rate0.100

0.0010.5

The gradient for squared loss

The most common loss in GATE problems is the squared error of a single point. For a prediction ŷ = w·x and target y,

L = ½ (w·x − y)²

Differentiating with respect to w (chain rule — the ½ cancels the 2 that comes down):

∂L/∂w = (w·x − y) · x      ←  this is "(prediction − target) × input"

So the update for one data point becomes

w ← w − η · (w·x − y) · x

That residual (w·x − y) is just the prediction error; the gradient is that error re-weighted by the input. Memorise the shape — GATE drops numbers straight into it.

How GATE asks this

Almost always a NAT: you are handed a current weight, a gradient (or enough to compute it), and a learning rate, and asked for the weight after one update. GATE DA 2026 (Q29) gave exactly this — one SGD step on f(x) = w·x, where the gradient works out to the input value — and wanted the new weight to two decimals. The whole task is one substitution into w − η·g. Occasionally it appears as an MCQ on what happens to η when it is too large or too small.

Worked example

The current weight is w = 10. The gradient of the loss at this point is 10. With learning rate η = 0.1, perform one gradient-descent update. What is w_new?

Substitute straight into the rule:

w_new = w − η · (∂L/∂w)
      = 10 − 0.1 × 10
      = 10 − 1
      = 9.00

The same three lines in Python land on the same number:

w, grad, eta = 10.0, 10.0, 0.1

step  = eta * grad        # 0.1 * 10 = 1.0
w_new = w - step          # 10 - 1 = 9.0

print(f"step  = {step}")
print(f"w_new = {w_new:.2f}")

step  = 1.0
w_new = 9.00

So w_new = 9.00. One step moved the weight down by η × gradient = 1. With a smaller η the step would be smaller and slower; with η large enough that η × gradient overshoots the minimum, the weight would jump clean past it and the loss could grow instead of shrink.

In one breath

Gradient descent reaches the bottom of a loss surface without ever inverting anything: it reads the slope ∂L/∂w (which points uphill), then steps the opposite way by the one-line rule w ← w − η·(∂L/∂w), where the learning rate η sets the step size — too small crawls, too large overshoots and diverges; for the squared loss L = ½(w·x − y)² the gradient is simply (w·x − y)·x, the prediction error times the input, so one update by hand is a single clean substitution.

Practice

Quick check

0/6

Q1Recall — In the rule w ← w − η·(∂L/∂w), why is the gradient SUBTRACTED?

Q2Recall — Which statements about the learning rate η are TRUE? (select all that apply)select all that apply

Q3Trace — Current weight w = 10, gradient ∂L/∂w = 10, learning rate η = 0.1. After one gradient-descent update, what is w_new? (2 decimals)numerical answer — type a number

Q4Trace — For L = ½(w·x − y)² with w = 2, x = 3, y = 5 and η = 0.1, what is w after one step? (2 decimals)numerical answer — type a number

Q5Trace — A model predicts ŷ = w·x = 8 while the true value is y = 8 (zero error). With squared loss L = ½(w·x − y)², what is the gradient ∂L/∂w at this point?numerical answer — type a number

Q6Apply — Starting from w = 5 with a constant gradient of 4 and η = 0.5, what is w after TWO updates? (2 decimals)numerical answer — type a number

A question to carry forward

Now you have two ways to the bottom of a loss bowl: the one-shot normal equation, and this patient downhill walk. Both do the same faithful job — they find the weights that make the training loss as small as it can go.

But sit with that word, faithful. Last lesson’s first warning was that a model fitting the training data too perfectly fits its noise too — those huge, wild weights that swing on new data. Gradient descent, done well, walks you straight into that trap, because the trap is at the very bottom of the bowl it is descending. So the cure cannot be a smarter optimizer; the cure is to change what you ask it to minimise — to add a clause that punishes oversized weights. Here is the thread onward: what does that extra clause look like, what single knob controls how hard it bites, and what is the model giving up in exchange for steadier predictions?

Gradient Descent (One Step)

What you'll learn

Before you start

The update rule

Click to drop a ball — watch it roll downhill

The gradient for squared loss

How GATE asks this

Worked example

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further