datarekha

Ridge Regression & Regularization

Ordinary least squares can overfit. Ridge adds an L2 penalty that shrinks the weights, trading a little bias for a lot less variance.

8 min read Intermediate GATE DA Lesson 81 of 122

What you'll learn

  • Why unconstrained least squares overfits, and how a penalty term fixes it
  • Ridge = L2 penalty: minimize Σ(yᵢ − ŷᵢ)² + λ·‖w‖²
  • Ridge is L2, lasso is L1 — the single distinction GATE checks
  • Larger λ shrinks weights toward zero: more bias, less variance
  • Computing a regularized loss to a clean number

Before you start

Ordinary least squares picks the weights that minimise the squared error on the training data — and only that. With many features or noisy data, it can fit the noise, producing huge weights that swing wildly on new data. That is overfitting.

Ridge regression tames it by adding a penalty for large weights to the loss. The model now has to balance two goals: fit the data and keep the weights small. The penalty pulls every weight toward zero, smoothing the fit.

The L2 penalty

Ridge minimises the usual squared error plus a penalty on the size of the weight vector, scaled by a knob λ (lambda):

∑(yᵢ − ŷᵢ)²+λ·‖w‖²(= ∑ w𝉜²)fit the datapenalty strengthL2 penalty (sum of squares)
Objective = data-fit term + λ × (sum of squared weights). The penalty is the L2 norm squared.
  • The first term, Σ(yᵢ − ŷᵢ)², is ordinary least squares — make predictions match.
  • ‖w‖² = Σ wⱼ² is the L2 penalty: the sum of the squares of the weights.
  • λ ≥ 0 controls the trade-off. At λ = 0 you recover plain least squares. As λ grows, the penalty dominates and all weights are pulled toward zero (but, for ridge, never exactly to zero).

A crucial naming point GATE leans on: ridge uses the L2 norm; lasso uses the L1 norm (Σ|wⱼ|, the sum of absolute values). Lasso can drive weights to exactly zero (feature selection); ridge only shrinks them. Don’t swap the two.

Slide λ up and watch every coefficient shrink. Toggle between L2 (ridge — smooth shrink, nothing hits zero) and L1 (lasso — small weights snap to zero):

How GATE asks this

Two recurring shapes. As an MCQ, it asks the direction of the effect: what happens to bias and variance as you increase λ. GATE DA 2026 posed exactly this (Q37) and the correct statement was that the regularizer increases bias and decreases variance. As a NAT, it gives you weights, a data-fit error, and λ, and asks for the total regularized objective — pure substitution (GATE DA 2026 Q55 did this with an MAE data term).

Worked example

Part A — the direction (GATE DA 2026). As the regularization coefficient λ increases, the weights are squeezed toward zero, so the model becomes simpler and stiffer. A stiffer model varies less from one training set to another (variance decreases) but systematically misses more structure (bias increases). The 2026 answer was exactly this pair: bias ↑, variance ↓.

Part B — a regularized loss (clean numbers). Take the ridge objective

J = (squared error on the data)  +  λ · ‖w‖²

with weights w = (3, −1), a data-fit squared error of 4, and λ = 0.5:

‖w‖²    = 3² + (−1)²        = 9 + 1   = 10
penalty = λ · ‖w‖² = 0.5 · 10         = 5
J       = data error + penalty = 4 + 5 = 9

So the total regularized objective is J = 9. Note the penalty (5) is larger than the data error (4) here — with λ = 0.5 and weights this size, the model is under real pressure to shrink them.

Quick check

Quick check

0/6
Q1Ridge objective J = (data error) + λ·‖w‖². Weights w = (3, −1), data error = 4, λ = 0.5. What is J?numerical answer — type a number
Q2Weights w = (2, 0, −4), data-fit squared error = 6, λ = 0.5. What is the total ridge objective J?numerical answer — type a number
Q3As the regularization coefficient λ increases in ridge regression, how do bias and variance change?
Q4Which statements correctly distinguish ridge from lasso? (select all that apply)select all that apply
Q5Which of the following are TRUE about the penalty strength λ? (select all that apply)select all that apply
Q6Using the L2 penalty, what is ‖w‖² for the weight vector w = (1, −2, 2)?numerical answer — type a number

Practice this in an interview

All questions
What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?

Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.

What is L2 regularisation (weight decay), and how does it reduce overfitting?

L2 regularisation adds a penalty equal to the sum of squared weights multiplied by a coefficient λ to the loss function, which encourages the optimiser to keep weights small. This penalises large, specialised weights and pushes the model toward simpler solutions that generalise better. In SGD it is equivalent to shrinking weights by a constant factor each step (hence weight decay), though in Adam the two diverge — requiring AdamW for correct decoupled decay.

What problem does ElasticNet solve that neither Lasso nor Ridge can handle alone?

When predictors are highly correlated, Lasso tends to arbitrarily pick one and discard the others, producing unstable feature selection. Ridge retains all correlated features but cannot zero any out. ElasticNet combines both penalties to achieve stable, sparse solutions — it groups correlated features and can shrink the whole group together.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content