Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MetaAsked at Netflix

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

How to think about it

Both methods add a penalty term to OLS loss to prevent overfitting, but the geometry of their constraint regions is fundamentally different.

Ridge (L2): Loss = ||y - Xβ||² + λ||β||²

The penalty is the sum of squared weights. The constraint region ||β||² ≤ t is a smooth hypersphere — its surface has no corners, so the solution rarely lands exactly at zero.

Lasso (L1): Loss = ||y - Xβ||² + λ||β||₁

The penalty is the sum of absolute weights. The constraint region ||β||₁ ≤ t is a diamond (in 2D) with sharp corners on the axes. The OLS ellipsoid most likely touches the diamond at a corner, where one or more coordinates are exactly zero — this is why Lasso yields sparse solutions.

L1 diamond corners lie on axes → sparse solutions. L2 circle has no corners → all weights stay nonzero.

ElasticNet combines both: λ₁||β||₁ + λ₂||β||², gaining Lasso’s sparsity plus Ridge’s stability when features are correlated.

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
enet  = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train)

# Check sparsity
import numpy as np
print("Lasso zeros:", np.sum(lasso.coef_ == 0))

Summary: Use Ridge when you expect many small effects (genomics, NLP bag-of-words). Use Lasso when you expect true sparsity — few features matter. Use ElasticNet when features are correlated and you still want sparsity.

Learn it properly L1, L2, Elastic Net

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

Keep practising

Explore further