How do you handle skewed features in a machine learning dataset, and why does skew matter?
Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.
How to think about it
Skew matters most for linear models (OLS assumes roughly normal residuals), neural networks (saturated activations slow training), and distance-based models (extreme values dominate Euclidean distances). Tree-based models split on rank and are largely immune.
Diagnosing skew
Use the skewness statistic or a visual check:
import pandas as pd
import matplotlib.pyplot as plt
skew = X_train["revenue"].skew() # > 1 or < -1 is notable
X_train["revenue"].hist(bins=50)
plt.show()
A skewness magnitude above 1 is a practical threshold for action.
Transformation options
Log transform: log(x + 1) is the most common fix for right-skewed, non-negative features (income, counts, transaction amounts). The +1 shift avoids log(0).
Square-root transform: gentler compression; useful when the feature contains zeros.
Box-Cox transform: finds the optimal power parameter λ to maximize normality. Requires strictly positive values.
Yeo-Johnson transform: like Box-Cox but works with zero and negative values — a better default.
from sklearn.preprocessing import PowerTransformer
import numpy as np
# Yeo-Johnson works on any sign; Box-Cox requires positive values
pt = PowerTransformer(method="yeo-johnson")
X_train_transformed = pt.fit_transform(X_train[["revenue", "page_views"]])
# Manual log for a single column
X_train["log_revenue"] = np.log1p(X_train["revenue"])
Which to choose
| Feature type | Transformation |
|---|---|
| Non-negative, right skew | Log1p |
| Non-negative, mild skew | Square root |
| Positive only, want optimal | Box-Cox |
| Any sign, want optimal | Yeo-Johnson |
| Target variable (regression) | Log or Yeo-Johnson; remember to invert predictions |
Skewed target variable
Transforming the target in regression is a different decision. Training on log(y) often stabilizes variance, but predictions must be back-transformed with exp(). Use mean_squared_error on the original scale, not the log scale, for final evaluation.