How do you detect and handle outliers in a machine learning dataset?
Outliers are detected via statistical rules (IQR, Z-score), visualization, or isolation-based algorithms. Handling options are removal, capping (Winsorization), transformation, or using robust algorithms. The right action depends on whether the outlier is a measurement error or a genuine extreme value — genuine extremes carry signal and should not be blindly removed.
How to think about it
Outliers hurt linear regression, KNN, K-means, and PCA heavily. Tree-based models are largely robust because splits care about rank, not magnitude.
Detection methods
IQR rule: flag values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR. Simple, non-parametric, does not assume normality.
Z-score: flag values with |z| > 3. Assumes roughly normal distribution; breaks down when the outliers themselves shift the mean and std.
Isolation Forest: an algorithm that isolates anomalies by randomly partitioning features. Efficient on high-dimensional data.
from sklearn.ensemble import IsolationForest
import numpy as np
iso = IsolationForest(contamination=0.02, random_state=42)
labels = iso.fit_predict(X_train) # -1 = outlier, 1 = inlier
X_clean = X_train[labels == 1]
Handling options
Remove: justified only for confirmed measurement errors (sensor malfunction, data entry mistake). Removing genuine extreme events biases the model.
Winsorize (cap): clip values at the 1st and 99th percentiles. Preserves the row but limits the influence of extremes. Good default for regression targets.
Transform: log, square-root, or Box-Cox transformation compresses the scale, reducing outlier influence while keeping the row.
Robust algorithms: HuberRegressor for regression, median-based statistics, or tree-based models that are naturally insensitive to magnitude.
from sklearn.preprocessing import RobustScaler
# RobustScaler uses median and IQR instead of mean and std
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
Decision framework
- Investigate: is this a data quality issue or a real event?
- If data error: remove or correct.
- If real but extreme: try Winsorization or a log transform first.
- If the model must handle real extremes at inference time: use a robust algorithm rather than removing training outliers.