datarekha

Gradient Descent (One Step)

The workhorse of model training, reduced to a single line: w ← w − η·(∂L/∂w). GATE asks you to perform exactly one update by hand.

7 min read Intermediate GATE DA Lesson 80 of 122

What you'll learn

  • The update rule: w ← w − η·(∂L/∂w) — step against the gradient
  • What the learning rate η controls, and why too large diverges while too small crawls
  • For squared loss L = ½(wx − y)², the gradient is (wx − y)·x
  • Performing one gradient-descent update by hand to a clean number

Before you start

Fitting a model means finding the weights that make a loss as small as possible. Gradient descent is the simplest way to get there: stand on the loss surface, look which way is downhill, and take a small step that direction. Repeat. The gradient ∂L/∂w points uphill (toward steeper loss), so to go down you step the opposite way — you subtract it.

That single move, written once, is the whole idea GATE tests — and it is the same move, repeated millions of times, that trains every neural network you have ever used.

The update rule

w_new=wη·(∂L / ∂w)updated weightlearning rategradient (slope)
New weight = old weight, nudged downhill by a fraction η of the slope.

Three pieces, and nothing more at exam level:

  • Gradient ∂L/∂w — the slope of the loss with respect to the weight. Positive slope means loss rises as w rises, so we should decrease w (and the minus sign does that automatically).
  • Learning rate η (eta) — a small positive number, the step size. It scales how far you move each update.
  • Subtraction — descent moves against the gradient. Step with the sign and you climb (gradient ascent); that is the classic sign error.

Drop a ball on this loss bowl, change the learning rate, and step it downhill by hand. Watch a small η crawl and a large one bounce across the valley:

The gradient for squared loss

The most common loss in GATE problems is the squared error of one point. For a prediction ŷ = w·x and target y,

L = ½ (w·x − y)²

Differentiating with respect to w (chain rule, the ½ cancels the 2):

∂L/∂w = (w·x − y) · x      ←  this is "(prediction − target) × input"

So the update for one data point becomes

w ← w − η · (w·x − y) · x

That residual (w·x − y) is just the prediction error; the gradient is that error re-weighted by the input. Memorise the shape — GATE plugs numbers straight into it.

How GATE asks this

Almost always a NAT: you are handed a current weight, a gradient (or enough to compute it), and a learning rate, and asked for the weight after one update. GATE DA 2026 (Q29) gave exactly this — one SGD step on f(x) = w·x, where the gradient works out to the input value — and wanted the new weight to two decimals. The whole task is one substitution into w − η·g. Occasionally it appears as an MCQ on what happens to η (too large, too small).

Worked example

The current weight is w = 10. The gradient of the loss at this point is 10. With learning rate η = 0.1, perform one gradient-descent update. What is w_new?

Substitute straight into the rule:

w_new = w − η · (∂L/∂w)
      = 10 − 0.1 × 10
      = 10 − 1
      = 9.00

So w_new = 9.00. One step moved the weight down by η × gradient = 1. With a smaller η the step would be smaller (slower); with η large enough that η × gradient overshoots the minimum, the weight would jump past it and the loss could grow.

Quick check

Quick check

0/6
Q1Current weight w = 10, gradient ∂L/∂w = 10, learning rate η = 0.1. After one gradient-descent update, what is w_new? (2 decimals)numerical answer — type a number
Q2For L = ½(w·x − y)² with w = 2, x = 3, y = 5 and η = 0.1, what is w after one step? (2 decimals)numerical answer — type a number
Q3In the rule w ← w − η·(∂L/∂w), why is the gradient SUBTRACTED?
Q4Which statements about the learning rate η are TRUE? (select all that apply)select all that apply
Q5A model predicts ŷ = w·x = 8 while the true value is y = 8 (zero error). With squared loss L = ½(w·x − y)², what is the gradient ∂L/∂w at this point?numerical answer — type a number
Q6Starting from w = 5 with a constant gradient of 4 and η = 0.5, what is w after TWO updates? (2 decimals)numerical answer — type a number

Practice this in an interview

All questions
How do LSTM gates solve the vanishing gradient problem?

An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

What is gradient accumulation and when do you need it?

Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content