datarekha

How do L1 and L2 regularization affect bias and variance, and when would you pick one over the other?

The short answer

Both L1 and L2 add a penalty on coefficient size that increases bias slightly but reduces variance, combating overfitting. L2 (ridge) shrinks all coefficients smoothly and handles correlated features well; L1 (lasso) drives some coefficients exactly to zero, performing feature selection. Choose L1 when you want sparsity and interpretability, L2 when you want stability, and elastic net to get both.

How to think about it

The crisp answer

Regularization adds a penalty term to the loss that discourages large coefficients. This deliberately introduces a little bias in exchange for a meaningful drop in variance, which usually improves generalization. L2 penalizes the sum of squared weights; L1 penalizes the sum of absolute weights.

Why they differ

The shapes of the penalty regions differ. L1’s diamond-shaped constraint has corners on the axes, so the optimum often lands exactly on an axis where some weights are zero — that is automatic feature selection. L2’s circular constraint shrinks weights toward zero smoothly but rarely to exactly zero. As the Analytics Vidhya bias-variance material frames it, regularization is the main knob for controlling the tradeoff.

The key formula in words

Minimize (training loss) + λ × (penalty). The hyperparameter λ controls strength: λ = 0 recovers the unregularized model (high variance); large λ forces weights toward zero (high bias). You tune λ by cross-validation.

When to pick which

  • L1 (lasso): many irrelevant features, you want a sparse, interpretable model, or you want built-in feature selection.
  • L2 (ridge): features are correlated, you want stable coefficients and smooth shrinkage.
  • Elastic net: a mix — sparsity plus stability under correlation.

The common trap

Forgetting to standardize features first: the penalty is scale-dependent, so unscaled features get penalized unequally. Also, L1 with a group of correlated features arbitrarily keeps one and zeros the rest, which can be misleading for interpretation. Follow-up: “Does increasing λ increase bias or variance?” — it increases bias and reduces variance.

Learn it properly Bias–variance & learning curves

Keep practising

All Machine Learning questions

Explore further

Skip to content