Why does regularization require feature scaling, and what happens if you skip it?
Regularization penalizes large coefficient magnitudes uniformly. If features are on different scales, a feature measured in thousands will naturally have a small coefficient while one measured in fractions will have a large one, so the penalty disproportionately shrinks some features and nearly ignores others. Standardization ensures the penalty is applied equally across all features.
How to think about it
Consider two features: income in dollars (range 30,000–200,000) and age in years (range 20–80). A one-unit change means very different things for each. OLS will fit income with a coefficient near 0.0001 and age with a coefficient near 5. Both are reasonable predictions, but:
- L2 penalty
λ(β_income² + β_age²)barely shrinks β_income (it is tiny) and strongly shrinks β_age (it is large), even if age is actually less important. - L1 penalty will drive β_age to zero before touching β_income — the wrong feature is eliminated.
The regularizer implicitly treats features as if their units are comparable. They only actually are after standardization.
Standard approach — z-score standardization:
x_scaled = (x - mean(x)) / std(x)
After scaling, each feature has mean 0 and std 1, so a coefficient of magnitude 2 means the same thing for every feature.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
pipe = Pipeline([
("scaler", StandardScaler()),
("model", Ridge(alpha=1.0))
])
pipe.fit(X_train, y_train)
# StandardScaler is fitted on train only — no leakage
Important notes:
- Always fit the scaler on training data only and transform train and test separately to avoid data leakage.
- Tree-based models (Random Forest, XGBoost) do not require scaling — they split on feature values, not magnitudes.
- The intercept is typically excluded from regularization in sklearn by default (
fit_intercept=Truedoes not penalize the bias term).