What are long-tailed (heavy-tailed) distributions and why do they matter in practice?
A long-tailed distribution has tail probabilities that decay much more slowly than the exponential — meaning extreme events are far more common than a Gaussian model would predict. They appear in internet traffic, wealth, natural language word frequency, and insurance claims, and they invalidate many standard statistical techniques that assume thin tails.
How to think about it
Define formally, give concrete examples, then explain which modelling and operational decisions change because of heavy tails — this is the applied payoff that interviewers care about.
Formal definition
A distribution is heavy-tailed if its tail decays slower than any exponential:
P(X > x) ~ x^(-alpha) for alpha > 0
This is a power-law tail. For α > 2 the variance is finite; for 1 < α ≤ 2 the variance is infinite but the mean exists; for α ≤ 1 even the mean is infinite (e.g., Pareto with α ≤ 1). Compare to the Gaussian, whose tails decay as exp(-x^2) — vanishingly fast.
The Pareto (80-20) rule
The Pareto distribution is the canonical power-law distribution. The 80-20 rule — 20% of inputs generate 80% of outputs — is a specific Pareto manifestation. In practice this shows up as:
- Top 1% of customers driving 50%+ of revenue
- Top 0.01% of web pages receiving 99% of traffic
- A handful of words accounting for half of all text (Zipf’s law)
Why thin-tail models fail
When you assume Gaussian tails (or any sub-exponential model) on heavy-tailed data:
- Risk underestimation: The probability of a 5-sigma event is 3 × 10⁻⁷ under Gaussian assumptions. In a Pareto-tailed distribution it can be 1,000 to 1,000,000 times higher.
- Mean as a summary: The mean may be a poor descriptor — or not even exist — when the tail is heavy enough. Median or percentiles are safer.
- CLT convergence: The CLT requires finite variance. For α ≤ 2, sample means converge extremely slowly (or not at all) to Gaussian, and variance estimates keep growing with n.
Practical implications for data science
Log-transform before modelling: Revenue, session lengths, file sizes — log transforms map power-law distributions toward something more Gaussian-like, stabilising variance for linear models.
Use robust or quantile-based metrics: Report median, 95th/99th percentiles for latency SLOs rather than means. A mean latency can look healthy while p99 is on fire.
Adjust A/B test duration: Heavy-tailed revenue metrics inflate variance, requiring more samples for adequate power. Winsorising (capping extreme values) reduces variance but introduces bias — document the trade-off.
Model the tail separately: Extreme value theory (EVT) and the Generalised Pareto Distribution are purpose-built for tail estimation.