Gradient Descent (One Step)
The workhorse of model training, reduced to a single line: w ← w − η·(∂L/∂w). GATE asks you to perform exactly one update by hand.
What you'll learn
- The update rule: w ← w − η·(∂L/∂w) — step against the gradient
- What the learning rate η controls, and why too large diverges while too small crawls
- For squared loss L = ½(wx − y)², the gradient is (wx − y)·x
- Performing one gradient-descent update by hand to a clean number
Before you start
Fitting a model means finding the weights that make a loss as small as possible.
Gradient descent is the simplest way to get there: stand on the loss surface,
look which way is downhill, and take a small step that direction. Repeat. The
gradient ∂L/∂w points uphill (toward steeper loss), so to go down you step the
opposite way — you subtract it.
That single move, written once, is the whole idea GATE tests — and it is the same move, repeated millions of times, that trains every neural network you have ever used.
The update rule
Three pieces, and nothing more at exam level:
- Gradient
∂L/∂w— the slope of the loss with respect to the weight. Positive slope means loss rises aswrises, so we should decreasew(and the minus sign does that automatically). - Learning rate
η(eta) — a small positive number, the step size. It scales how far you move each update. - Subtraction — descent moves against the gradient. Step with the sign and you climb (gradient ascent); that is the classic sign error.
Drop a ball on this loss bowl, change the learning rate, and step it downhill by
hand. Watch a small η crawl and a large one bounce across the valley:
The gradient for squared loss
The most common loss in GATE problems is the squared error of one point. For a
prediction ŷ = w·x and target y,
L = ½ (w·x − y)²
Differentiating with respect to w (chain rule, the ½ cancels the 2):
∂L/∂w = (w·x − y) · x ← this is "(prediction − target) × input"
So the update for one data point becomes
w ← w − η · (w·x − y) · x
That residual (w·x − y) is just the prediction error; the gradient is that error
re-weighted by the input. Memorise the shape — GATE plugs numbers straight into it.
How GATE asks this
Almost always a NAT: you are handed a current weight, a gradient (or enough to
compute it), and a learning rate, and asked for the weight after one update. GATE
DA 2026 (Q29) gave exactly this — one SGD step on f(x) = w·x, where the gradient
works out to the input value — and wanted the new weight to two decimals. The whole task is one substitution into
w − η·g. Occasionally it appears as an MCQ on what happens to η (too large,
too small).
Worked example
The current weight is
w = 10. The gradient of the loss at this point is10. With learning rateη = 0.1, perform one gradient-descent update. What isw_new?
Substitute straight into the rule:
w_new = w − η · (∂L/∂w)
= 10 − 0.1 × 10
= 10 − 1
= 9.00
So w_new = 9.00. One step moved the weight down by η × gradient = 1. With a
smaller η the step would be smaller (slower); with η large enough that η × gradient
overshoots the minimum, the weight would jump past it and the loss could grow.
Quick check
Quick check
Practice this in an interview
All questionsAn LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.
Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.
Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.
The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.