What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

How do L1 and L2 regularization affect bias and variance, and when would you pick one over the other?

Both L1 and L2 add a penalty on coefficient size that increases bias slightly but reduces variance, combating overfitting. L2 (ridge) shrinks all coefficients smoothly and handles correlated features well; L1 (lasso) drives some coefficients exactly to zero, performing feature selection. Choose L1 when you want sparsity and interpretability, L2 when you want stability, and elastic net to get both.

What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?

Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.

What is L2 regularisation (weight decay), and how does it reduce overfitting?

L2 regularisation adds a penalty equal to the sum of squared weights multiplied by a coefficient λ to the loss function, which encourages the optimiser to keep weights small. This penalises large, specialised weights and pushes the model toward simpler solutions that generalise better. In SGD it is equivalent to shrinking weights by a constant factor each step (hence weight decay), though in Adam the two diverge — requiring AdamW for correct decoupled decay.

Ridge Regression & Regularization — GATE DA

What you'll learn

Why unconstrained least squares overfits, and how a penalty term fixes it

Ridge = L2 penalty: minimize Σ(yᵢ − ŷᵢ)² + λ·‖w‖²

Ridge is L2, lasso is L1 — the single distinction GATE checks

Larger λ shrinks weights toward zero: more bias, less variance

Computing a regularized loss to a clean number

Last lesson closed on an uncomfortable truth: our optimizers are too good. Both the normal equation and gradient descent drive the training loss to its floor — and at that floor, with many features or noisy data, sit huge, jittery weights that fit the noise and swing wildly on anything new. That is overfitting, and you cannot optimize your way out of it, because the trap is the very bottom of the bowl you are descending.

So we change the bowl. Instead of asking the model only to fit the data, we add a second clause to its objective: and keep your weights small. The model now has to balance two pulls — match the data, yet stay modest — and the modesty clause tugs every weight toward zero, smoothing the fit. That added clause is regularization, and its most common form is ridge regression.

The L2 penalty

Ridge minimises the usual squared error plus a penalty on the size of the weight vector, scaled by a knob λ (lambda):

Objective = data-fit term + λ × (sum of squared weights). The penalty is the L2 norm squared.

The first term, Σ(yᵢ − ŷᵢ)², is ordinary least squares — make the predictions match.
‖w‖² = Σ wⱼ² is the L2 penalty: the sum of the squares of the weights.
λ ≥ 0 controls the trade-off. At λ = 0 you recover plain least squares; as λ grows, the penalty dominates and every weight is pulled toward zero (though, for ridge, never exactly to zero).

One naming point GATE leans on hard: ridge uses the L2 norm; lasso uses the L1 norm (Σ|wⱼ|, the sum of absolute values). Lasso can drive weights to exactly zero, performing feature selection; ridge only shrinks them. Do not swap the two.

Slide λ up and watch every coefficient shrink. Toggle between L2 (ridge — a smooth shrink, nothing reaching zero) and L1 (lasso — small weights snapping to zero):

TryRegularization path

L1 zeroes out features. L2 just shrinks.

λ (regularization strength)

0.0131.6

λ = 0.100

non-zero features5/ 8— Lasso zeroed 3 features

How GATE asks this

Two recurring shapes. As an MCQ, it asks the direction of the effect: what happens to bias and variance as you raise λ. GATE DA 2026 posed exactly this (Q37), and the correct statement was that the regularizer increases bias and decreases variance. As a NAT, it gives you weights, a data-fit error, and λ, and asks for the total regularized objective — pure substitution (GATE DA 2026 Q55 did this with an MAE data term).

Worked example

Part A — the direction (GATE DA 2026). As the regularization coefficient λ rises, the weights are squeezed toward zero, so the model becomes simpler and stiffer. A stiffer model varies less from one training set to the next (variance decreases), but systematically misses more structure (bias increases). The 2026 answer was exactly that pair: bias ↑, variance ↓.

Part B — a regularized loss (clean numbers). Take the ridge objective J = (squared error on the data) + λ · ‖w‖² with weights w = (3, −1), a data-fit squared error of 4, and λ = 0.5:

‖w‖²    = 3² + (−1)²        = 9 + 1   = 10
penalty = λ · ‖w‖² = 0.5 · 10         = 5
J       = data error + penalty = 4 + 5 = 9

The same steps in Python:

w          = [3, -1]
data_error = 4
lam        = 0.5

l2_norm_sq = sum(wi**2 for wi in w)   # 9 + 1 = 10
penalty    = lam * l2_norm_sq          # 0.5 * 10 = 5
J          = data_error + penalty      # 4 + 5 = 9

print(f"||w||^2 = {l2_norm_sq}")
print(f"penalty = {penalty}")
print(f"J       = {J}")

||w||^2 = 10
penalty = 5.0
J       = 9.0

So the total regularized objective is J = 9. Notice the penalty (5) is larger than the data error (4): with λ = 0.5 and weights this size, the model really is under pressure to shrink them.

In one breath

Ordinary least squares can overfit by driving the weights huge to chase noise, so ridge regression adds an L2 penalty to the objective — minimise Σ(yᵢ − ŷᵢ)² + λ·‖w‖², where ‖w‖² = Σwⱼ² and λ ≥ 0 sets how hard the weights are pulled toward zero (λ = 0 is plain OLS); raising λ shrinks the weights, lowering variance at the cost of higher bias, and the one naming trap is that ridge is L2 while lasso is L1 (Σ|wⱼ|), the version that can zero weights out entirely.

Practice

Quick check

0/6

Q1Recall — Which statements correctly distinguish ridge from lasso? (select all that apply)select all that apply

Q2Recall — Which of the following are TRUE about the penalty strength λ? (select all that apply)select all that apply

Q3Apply — As the regularization coefficient λ increases in ridge regression, how do bias and variance change?

Q4Trace — Using the L2 penalty, what is ‖w‖² for the weight vector w = (1, −2, 2)?numerical answer — type a number

Q5Trace — Ridge objective J = (data error) + λ·‖w‖². Weights w = (3, −1), data error = 4, λ = 0.5. What is J?numerical answer — type a number

Q6Trace — Weights w = (2, 0, −4), data-fit squared error = 6, λ = 0.5. What is the total ridge objective J?numerical answer — type a number

A question to carry forward

Ridge gave us one precise dial: turn λ up, and variance falls while bias rises. We have been saying those two words as if their meaning were settled — but they are the deepest pair in all of modelling, and ridge only nudges one instance of a tug that every model feels.

Why can you never just drive both to zero? A model simple enough to be steady across datasets (low variance) is too rigid to capture the truth (high bias); a model flexible enough to capture every wrinkle (low bias) reshapes itself wildly from one sample to the next (high variance). Here is the thread onward: what exactly are bias and variance, how do they split the total error of any model, and where does their sum bottom out — the sweet spot every technique in this chapter is secretly chasing?

Ridge Regression & Regularization

What you'll learn

Before you start

The L2 penalty

L1 zeroes out features. L2 just shrinks.

How GATE asks this

Worked example

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further