What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?
Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.
How to think about it
MAP estimation setup:
Bayes’ theorem: P(β | X, y) ∝ P(y | X, β) * P(β)
Assume:
- Likelihood:
y | X, β ~ N(Xβ, σ²I)(Gaussian noise) - Prior:
β ~ N(0, τ²I)(zero-mean Gaussian, isotropic)
Taking the negative log of the posterior:
-log P(β | X, y) ∝ ||y - Xβ||² / σ² + ||β||² / τ²
This is exactly Ridge regression with λ = σ² / τ².
What each prior corresponds to:
| Regularizer | Bayesian Prior | Distribution Shape |
|---|---|---|
| Ridge (L2) | Gaussian N(0, τ²) | Smooth, exponential decay |
| Lasso (L1) | Laplace (double-exponential) | Heavy tails, sharp peak at 0 → sparsity |
| No regularization | Flat (improper uniform) | No preference |
The Laplace prior has a sharp spike at zero, which gives L1 its tendency to produce exactly-zero coefficients. The Gaussian prior is smooth at zero, so it shrinks but does not zero.
Implications for practice:
- λ → ∞: prior dominates; all β → 0 (heavy regularization, strong prior belief in small effects).
- λ → 0: likelihood dominates; MAP → MLE → OLS (flat prior, data speaks entirely).
- The ratio
σ²/τ²encodes your belief about signal-to-noise: low noise or wide prior → small λ.
from sklearn.linear_model import BayesianRidge
# Full Bayesian treatment — infers alpha (noise precision) and lambda (weight precision)
bayes_ridge = BayesianRidge()
bayes_ridge.fit(X_train, y_train)
print("Estimated alpha (noise):", bayes_ridge.alpha_)
print("Estimated lambda (weight):", bayes_ridge.lambda_)