How do you select the regularization strength λ, and what does it mean to set it too high or too low?
λ is a bias-variance trade-off knob: too low leaves the model overfit (high variance); too high over-regularizes and underfits (high bias). The standard approach is k-fold cross-validation over a logarithmic grid of λ values, minimizing held-out loss.
How to think about it
Effect of λ on bias and variance:
- λ = 0: no regularization → pure OLS; low bias, high variance; prone to overfitting in high dimensions.
- λ → ∞: all coefficients → 0; the model predicts the constant mean everywhere; high bias, zero variance.
- Optimal λ: balances the two; found empirically via cross-validation.
Cross-validation approach:
Evaluate models across a log-spaced grid of λ. Use k-fold (typically k=5 or 10) to estimate generalization loss at each value.
from sklearn.linear_model import RidgeCV, LassoCV
import numpy as np
alphas = np.logspace(-4, 4, 100) # 100 values from 0.0001 to 10000
# RidgeCV uses efficient leave-one-out CV by default
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)
print("Best lambda:", ridge_cv.alpha_)
# LassoCV uses coordinate descent along the regularization path
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=5000)
lasso_cv.fit(X_train, y_train)
print("Best lambda:", lasso_cv.alpha_)
The regularization path:
As λ increases from 0:
- Ridge: all coefficients shrink smoothly toward zero, maintaining their relative ordering.
- Lasso: coefficients hit zero sequentially; the order in which they drop out reveals feature importance.
from sklearn.linear_model import lasso_path
alphas, coefs, _ = lasso_path(X_train, y_train)
# coefs shape: (n_features, n_alphas) — shows which features survive as lambda grows
1-standard-error rule: rather than selecting the λ with minimum CV error, many practitioners choose the largest λ whose CV error is within one standard error of the minimum. This prefers a simpler, more regularized model when the performance difference is negligible.