datarekha
Statistics & Probability Medium Asked at AmazonAsked at UberAsked at GoogleAsked at Stripe

How do you handle outliers statistically, and how do you decide whether to remove them?

The short answer

Handling outliers starts with understanding whether they are errors, rare genuine observations, or leverage points that reveal real signal. The appropriate response — removal, transformation, robust estimation, or explicit modelling — depends entirely on their cause, not on how extreme they look.

How to think about it

Reflexively dropping outliers is a form of data manipulation. The right approach is to diagnose first, then choose the statistically defensible response.

Step 1 — Diagnose the outlier

Data error: sensor malfunction, typo (age = 999), unit mismatch (dollars vs thousands). Fix or remove after documenting.

Genuine extreme value: a customer who spent $50 000 in a single transaction is real and potentially the most important record in the dataset. Removing it distorts your model of customer lifetime value.

Structural outlier / change point: an observation from a different underlying process (a product recall, a system outage). May need to be modelled separately or flagged as a stratum.

Detection methods

  • IQR rule: flag values below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR (Tukey fences). Appropriate for roughly symmetric distributions.
  • Z-score: flag |z| > 3. Problematic because the mean and SD are themselves inflated by the outliers.
  • Modified Z-score: uses median and MAD (median absolute deviation) instead. Robust to masking.
  • Mahalanobis distance: multivariate generalisation; detects multivariate outliers invisible in individual dimensions.
  • Isolation Forest / LOF: algorithmic methods for high-dimensional data.

Statistical responses

ResponseWhen appropriate
RemoveConfirmed data error; document the decision
Winsorise / capRetain the observation but limit its influence; common in financial data
Log or Box-Cox transformCompresses the scale; appropriate when the distribution is right-skewed
Robust estimatorsUse median instead of mean; use Huber loss or quantile regression instead of OLS
Model explicitlyMixture model or heavy-tailed distribution (t-distribution for regression errors)

Influence vs leverage

Leverage measures how extreme an observation is in the predictor space (hat matrix diagonal h𝑖𝑖). Influence measures how much the fitted model changes if the point is removed (Cook’s Distance combines leverage and residual size). High leverage is not automatically harmful; high influence combined with high leverage typically is.

Cook's D ≈ h_ii · r_i² / (p · (1 - h_ii)²)

Values of Cook’s D > 4/n or > 1 (depending on convention) warrant investigation.

Keep practising

All Statistics & Probability questions

Explore further

Skip to content