datarekha
Machine Learning Medium Asked at AmazonAsked at AppleAsked at Spotify

How do you select the regularization strength λ, and what does it mean to set it too high or too low?

The short answer

λ is a bias-variance trade-off knob: too low leaves the model overfit (high variance); too high over-regularizes and underfits (high bias). The standard approach is k-fold cross-validation over a logarithmic grid of λ values, minimizing held-out loss.

How to think about it

Effect of λ on bias and variance:

  • λ = 0: no regularization → pure OLS; low bias, high variance; prone to overfitting in high dimensions.
  • λ → ∞: all coefficients → 0; the model predicts the constant mean everywhere; high bias, zero variance.
  • Optimal λ: balances the two; found empirically via cross-validation.

Cross-validation approach:

Evaluate models across a log-spaced grid of λ. Use k-fold (typically k=5 or 10) to estimate generalization loss at each value.

from sklearn.linear_model import RidgeCV, LassoCV
import numpy as np

alphas = np.logspace(-4, 4, 100)  # 100 values from 0.0001 to 10000

# RidgeCV uses efficient leave-one-out CV by default
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)
print("Best lambda:", ridge_cv.alpha_)

# LassoCV uses coordinate descent along the regularization path
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=5000)
lasso_cv.fit(X_train, y_train)
print("Best lambda:", lasso_cv.alpha_)

The regularization path:

As λ increases from 0:

  • Ridge: all coefficients shrink smoothly toward zero, maintaining their relative ordering.
  • Lasso: coefficients hit zero sequentially; the order in which they drop out reveals feature importance.
from sklearn.linear_model import lasso_path

alphas, coefs, _ = lasso_path(X_train, y_train)
# coefs shape: (n_features, n_alphas) — shows which features survive as lambda grows

1-standard-error rule: rather than selecting the λ with minimum CV error, many practitioners choose the largest λ whose CV error is within one standard error of the minimum. This prefers a simpler, more regularized model when the performance difference is negligible.

Learn it properly L1, L2, Elastic Net

Keep practising

All Machine Learning questions

Explore further

Skip to content