Machine Learning Hard Asked at GoogleAsked at DeepMindAsked at Two Sigma

What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.

How to think about it

MAP estimation setup:

Bayes’ theorem: P(β | X, y) ∝ P(y | X, β) * P(β)

Assume:

Likelihood: y | X, β ~ N(Xβ, σ²I) (Gaussian noise)
Prior: β ~ N(0, τ²I) (zero-mean Gaussian, isotropic)

Taking the negative log of the posterior:

-log P(β | X, y) ∝ ||y - Xβ||² / σ² + ||β||² / τ²

This is exactly Ridge regression with λ = σ² / τ².

What each prior corresponds to:

Regularizer	Bayesian Prior	Distribution Shape
Ridge (L2)	Gaussian N(0, τ²)	Smooth, exponential decay
Lasso (L1)	Laplace (double-exponential)	Heavy tails, sharp peak at 0 → sparsity
No regularization	Flat (improper uniform)	No preference

The Laplace prior has a sharp spike at zero, which gives L1 its tendency to produce exactly-zero coefficients. The Gaussian prior is smooth at zero, so it shrinks but does not zero.

Implications for practice:

λ → ∞: prior dominates; all β → 0 (heavy regularization, strong prior belief in small effects).
λ → 0: likelihood dominates; MAP → MLE → OLS (flat prior, data speaks entirely).
The ratio σ²/τ² encodes your belief about signal-to-noise: low noise or wide prior → small λ.

from sklearn.linear_model import BayesianRidge

# Full Bayesian treatment — infers alpha (noise precision) and lambda (weight precision)
bayes_ridge = BayesianRidge()
bayes_ridge.fit(X_train, y_train)
print("Estimated alpha (noise):", bayes_ridge.alpha_)
print("Estimated lambda (weight):", bayes_ridge.lambda_)

Learn it properly L1, L2, Elastic Net

What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?

Keep practising

Explore further