How do you handle outliers statistically, and how do you decide whether to remove them?
Handling outliers starts with understanding whether they are errors, rare genuine observations, or leverage points that reveal real signal. The appropriate response — removal, transformation, robust estimation, or explicit modelling — depends entirely on their cause, not on how extreme they look.
How to think about it
Reflexively dropping outliers is a form of data manipulation. The right approach is to diagnose first, then choose the statistically defensible response.
Step 1 — Diagnose the outlier
Data error: sensor malfunction, typo (age = 999), unit mismatch (dollars vs thousands). Fix or remove after documenting.
Genuine extreme value: a customer who spent $50 000 in a single transaction is real and potentially the most important record in the dataset. Removing it distorts your model of customer lifetime value.
Structural outlier / change point: an observation from a different underlying process (a product recall, a system outage). May need to be modelled separately or flagged as a stratum.
Detection methods
- IQR rule: flag values below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR (Tukey fences). Appropriate for roughly symmetric distributions.
- Z-score: flag |z| > 3. Problematic because the mean and SD are themselves inflated by the outliers.
- Modified Z-score: uses median and MAD (median absolute deviation) instead. Robust to masking.
- Mahalanobis distance: multivariate generalisation; detects multivariate outliers invisible in individual dimensions.
- Isolation Forest / LOF: algorithmic methods for high-dimensional data.
Statistical responses
| Response | When appropriate |
|---|---|
| Remove | Confirmed data error; document the decision |
| Winsorise / cap | Retain the observation but limit its influence; common in financial data |
| Log or Box-Cox transform | Compresses the scale; appropriate when the distribution is right-skewed |
| Robust estimators | Use median instead of mean; use Huber loss or quantile regression instead of OLS |
| Model explicitly | Mixture model or heavy-tailed distribution (t-distribution for regression errors) |
Influence vs leverage
Leverage measures how extreme an observation is in the predictor space (hat matrix diagonal h𝑖𝑖). Influence measures how much the fitted model changes if the point is removed (Cook’s Distance combines leverage and residual size). High leverage is not automatically harmful; high influence combined with high leverage typically is.
Cook's D ≈ h_ii · r_i² / (p · (1 - h_ii)²)
Values of Cook’s D > 4/n or > 1 (depending on convention) warrant investigation.