datarekha
Machine Learning Medium Asked at AmazonAsked at Goldman SachsAsked at Walmart

How do you detect and handle outliers in a machine learning dataset?

The short answer

Outliers are detected via statistical rules (IQR, Z-score), visualization, or isolation-based algorithms. Handling options are removal, capping (Winsorization), transformation, or using robust algorithms. The right action depends on whether the outlier is a measurement error or a genuine extreme value — genuine extremes carry signal and should not be blindly removed.

How to think about it

Outliers hurt linear regression, KNN, K-means, and PCA heavily. Tree-based models are largely robust because splits care about rank, not magnitude.

Detection methods

IQR rule: flag values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR. Simple, non-parametric, does not assume normality.

Z-score: flag values with |z| > 3. Assumes roughly normal distribution; breaks down when the outliers themselves shift the mean and std.

Isolation Forest: an algorithm that isolates anomalies by randomly partitioning features. Efficient on high-dimensional data.

from sklearn.ensemble import IsolationForest
import numpy as np

iso = IsolationForest(contamination=0.02, random_state=42)
labels = iso.fit_predict(X_train)   # -1 = outlier, 1 = inlier
X_clean = X_train[labels == 1]

Handling options

Remove: justified only for confirmed measurement errors (sensor malfunction, data entry mistake). Removing genuine extreme events biases the model.

Winsorize (cap): clip values at the 1st and 99th percentiles. Preserves the row but limits the influence of extremes. Good default for regression targets.

Transform: log, square-root, or Box-Cox transformation compresses the scale, reducing outlier influence while keeping the row.

Robust algorithms: HuberRegressor for regression, median-based statistics, or tree-based models that are naturally insensitive to magnitude.

from sklearn.preprocessing import RobustScaler

# RobustScaler uses median and IQR instead of mean and std
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)

Decision framework

  1. Investigate: is this a data quality issue or a real event?
  2. If data error: remove or correct.
  3. If real but extreme: try Winsorization or a log transform first.
  4. If the model must handle real extremes at inference time: use a robust algorithm rather than removing training outliers.

Keep practising

All Machine Learning questions

Explore further

Skip to content