Ridge Regression & Regularization
Ordinary least squares can overfit. Ridge adds an L2 penalty that shrinks the weights, trading a little bias for a lot less variance.
What you'll learn
- Why unconstrained least squares overfits, and how a penalty term fixes it
- Ridge = L2 penalty: minimize Σ(yᵢ − ŷᵢ)² + λ·‖w‖²
- Ridge is L2, lasso is L1 — the single distinction GATE checks
- Larger λ shrinks weights toward zero: more bias, less variance
- Computing a regularized loss to a clean number
Before you start
Ordinary least squares picks the weights that minimise the squared error on the training data — and only that. With many features or noisy data, it can fit the noise, producing huge weights that swing wildly on new data. That is overfitting.
Ridge regression tames it by adding a penalty for large weights to the loss. The model now has to balance two goals: fit the data and keep the weights small. The penalty pulls every weight toward zero, smoothing the fit.
The L2 penalty
Ridge minimises the usual squared error plus a penalty on the size of the weight
vector, scaled by a knob λ (lambda):
- The first term,
Σ(yᵢ − ŷᵢ)², is ordinary least squares — make predictions match. ‖w‖² = Σ wⱼ²is the L2 penalty: the sum of the squares of the weights.λ ≥ 0controls the trade-off. Atλ = 0you recover plain least squares. Asλgrows, the penalty dominates and all weights are pulled toward zero (but, for ridge, never exactly to zero).
A crucial naming point GATE leans on: ridge uses the L2 norm; lasso uses the L1
norm (Σ|wⱼ|, the sum of absolute values). Lasso can drive weights to exactly
zero (feature selection); ridge only shrinks them. Don’t swap the two.
Slide λ up and watch every coefficient shrink. Toggle between L2 (ridge — smooth
shrink, nothing hits zero) and L1 (lasso — small weights snap to zero):
How GATE asks this
Two recurring shapes. As an MCQ, it asks the direction of the effect: what
happens to bias and variance as you increase λ. GATE DA 2026 posed exactly this
(Q37) and the correct statement was that the regularizer increases bias and
decreases variance. As a NAT, it gives you weights, a data-fit error, and λ,
and asks for the total regularized objective — pure substitution (GATE DA 2026 Q55
did this with an MAE data term).
Worked example
Part A — the direction (GATE DA 2026). As the regularization coefficient λ
increases, the weights are squeezed toward zero, so the model becomes simpler and
stiffer. A stiffer model varies less from one training set to another (variance
decreases) but systematically misses more structure (bias increases). The 2026
answer was exactly this pair: bias ↑, variance ↓.
Part B — a regularized loss (clean numbers). Take the ridge objective
J = (squared error on the data) + λ · ‖w‖²
with weights w = (3, −1), a data-fit squared error of 4, and λ = 0.5:
‖w‖² = 3² + (−1)² = 9 + 1 = 10
penalty = λ · ‖w‖² = 0.5 · 10 = 5
J = data error + penalty = 4 + 5 = 9
So the total regularized objective is J = 9. Note the penalty (5) is larger
than the data error (4) here — with λ = 0.5 and weights this size, the model is
under real pressure to shrink them.
Quick check
Quick check
Practice this in an interview
All questionsL1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.
Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.
L2 regularisation adds a penalty equal to the sum of squared weights multiplied by a coefficient λ to the loss function, which encourages the optimiser to keep weights small. This penalises large, specialised weights and pushes the model toward simpler solutions that generalise better. In SGD it is equivalent to shrinking weights by a constant factor each step (hence weight decay), though in Adam the two diverge — requiring AdamW for correct decoupled decay.
When predictors are highly correlated, Lasso tends to arbitrarily pick one and discard the others, producing unstable feature selection. Ridge retains all correlated features but cannot zero any out. ElasticNet combines both penalties to achieve stable, sparse solutions — it groups correlated features and can shrink the whole group together.