datarekha
Machine Learning Easy Asked at AmazonAsked at FlipkartAsked at Walmart

How do you handle skewed features in a machine learning dataset, and why does skew matter?

The short answer

Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.

How to think about it

Skew matters most for linear models (OLS assumes roughly normal residuals), neural networks (saturated activations slow training), and distance-based models (extreme values dominate Euclidean distances). Tree-based models split on rank and are largely immune.

Diagnosing skew

Use the skewness statistic or a visual check:

import pandas as pd
import matplotlib.pyplot as plt

skew = X_train["revenue"].skew()   # > 1 or < -1 is notable
X_train["revenue"].hist(bins=50)
plt.show()

A skewness magnitude above 1 is a practical threshold for action.

Transformation options

Log transform: log(x + 1) is the most common fix for right-skewed, non-negative features (income, counts, transaction amounts). The +1 shift avoids log(0).

Square-root transform: gentler compression; useful when the feature contains zeros.

Box-Cox transform: finds the optimal power parameter λ to maximize normality. Requires strictly positive values.

Yeo-Johnson transform: like Box-Cox but works with zero and negative values — a better default.

from sklearn.preprocessing import PowerTransformer
import numpy as np

# Yeo-Johnson works on any sign; Box-Cox requires positive values
pt = PowerTransformer(method="yeo-johnson")
X_train_transformed = pt.fit_transform(X_train[["revenue", "page_views"]])

# Manual log for a single column
X_train["log_revenue"] = np.log1p(X_train["revenue"])

Which to choose

Feature typeTransformation
Non-negative, right skewLog1p
Non-negative, mild skewSquare root
Positive only, want optimalBox-Cox
Any sign, want optimalYeo-Johnson
Target variable (regression)Log or Yeo-Johnson; remember to invert predictions

Skewed target variable

Transforming the target in regression is a different decision. Training on log(y) often stabilizes variance, but predictions must be back-transformed with exp(). Use mean_squared_error on the original scale, not the log scale, for final evaluation.

Keep practising

All Machine Learning questions

Explore further

Skip to content