Statistics & Probability Medium Asked at NetflixAsked at MetaAsked at Two SigmaAsked at Stripe

What are long-tailed (heavy-tailed) distributions and why do they matter in practice?

For Data Scientist Data Analyst ML Engineer AI / LLM Engineer

The short answer

A long-tailed distribution has tail probabilities that decay much more slowly than the exponential — meaning extreme events are far more common than a Gaussian model would predict. They appear in internet traffic, wealth, natural language word frequency, and insurance claims, and they invalidate many standard statistical techniques that assume thin tails.

How to think about it

Define formally, give concrete examples, then explain which modelling and operational decisions change because of heavy tails — this is the applied payoff that interviewers care about.

Formal definition

A distribution is heavy-tailed if its tail decays slower than any exponential:

P(X > x) ~ x^(-alpha) for alpha > 0

This is a power-law tail. For α > 2 the variance is finite; for 1 < α ≤ 2 the variance is infinite but the mean exists; for α ≤ 1 even the mean is infinite (e.g., Pareto with α ≤ 1). Compare to the Gaussian, whose tails decay as exp(-x^2) — vanishingly fast.

The Pareto (80-20) rule

The Pareto distribution is the canonical power-law distribution. The 80-20 rule — 20% of inputs generate 80% of outputs — is a specific Pareto manifestation. In practice this shows up as:

Top 1% of customers driving 50%+ of revenue
Top 0.01% of web pages receiving 99% of traffic
A handful of words accounting for half of all text (Zipf’s law)

Why thin-tail models fail

When you assume Gaussian tails (or any sub-exponential model) on heavy-tailed data:

Risk underestimation: The probability of a 5-sigma event is 3 × 10⁻⁷ under Gaussian assumptions. In a Pareto-tailed distribution it can be 1,000 to 1,000,000 times higher.
Mean as a summary: The mean may be a poor descriptor — or not even exist — when the tail is heavy enough. Median or percentiles are safer.
CLT convergence: The CLT requires finite variance. For α ≤ 2, sample means converge extremely slowly (or not at all) to Gaussian, and variance estimates keep growing with n.

Practical implications for data science

Log-transform before modelling: Revenue, session lengths, file sizes — log transforms map power-law distributions toward something more Gaussian-like, stabilising variance for linear models.

Use robust or quantile-based metrics: Report median, 95th/99th percentiles for latency SLOs rather than means. A mean latency can look healthy while p99 is on fire.

Adjust A/B test duration: Heavy-tailed revenue metrics inflate variance, requiring more samples for adequate power. Winsorising (capping extreme values) reduces variance but introduces bias — document the trade-off.

Model the tail separately: Extreme value theory (EVT) and the Generalised Pareto Distribution are purpose-built for tail estimation.

Learn it properly Distributions you should know

What are long-tailed (heavy-tailed) distributions and why do they matter in practice?

Formal definition

The Pareto (80-20) rule

Why thin-tail models fail

Practical implications for data science

Keep practising

Explore further