What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?
L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.
How to think about it
Both methods add a penalty term to OLS loss to prevent overfitting, but the geometry of their constraint regions is fundamentally different.
Ridge (L2): Loss = ||y - Xβ||² + λ||β||²
The penalty is the sum of squared weights. The constraint region ||β||² ≤ t is a smooth hypersphere — its surface has no corners, so the solution rarely lands exactly at zero.
Lasso (L1): Loss = ||y - Xβ||² + λ||β||₁
The penalty is the sum of absolute weights. The constraint region ||β||₁ ≤ t is a diamond (in 2D) with sharp corners on the axes. The OLS ellipsoid most likely touches the diamond at a corner, where one or more coordinates are exactly zero — this is why Lasso yields sparse solutions.
ElasticNet combines both: λ₁||β||₁ + λ₂||β||², gaining Lasso’s sparsity plus Ridge’s stability when features are correlated.
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train)
# Check sparsity
import numpy as np
print("Lasso zeros:", np.sum(lasso.coef_ == 0))
Summary: Use Ridge when you expect many small effects (genomics, NLP bag-of-words). Use Lasso when you expect true sparsity — few features matter. Use ElasticNet when features are correlated and you still want sparsity.