datarekha
Statistics & Probability Easy Asked at Goldman SachsAsked at AmazonAsked at MetaAsked at Airbnb

What is survivorship bias, and how does it distort statistical conclusions?

The short answer

Survivorship bias occurs when analysis is restricted to the subset of observations that 'survived' some selection process, ignoring the failures that did not. The surviving sample is systematically unrepresentative, inflating estimates of success and hiding the true risk.

How to think about it

Survivorship bias is a form of selection bias that is especially insidious because the missing data — the failures — leave no trace in the dataset. You are literally studying the wrong population.

Classic example — WWII aircraft armour

During World War II, the RAF analysed bullet holes on planes that returned from missions. The holes clustered on wings and fuselage. The naive recommendation was to reinforce those areas.

Abraham Wald pointed out the fatal flaw: the data came only from planes that returned. Planes hit in the engines or cockpit did not return — they were shot down. The correct recommendation was to reinforce the areas with no observed damage, because those were the hits that caused fatal losses.

Financial example — mutual fund performance

A database of mutual fund returns built in 2026 from funds that existed continuously since 2016 will exclude all funds that closed or merged in that period — typically the worst performers. Estimating average fund skill from surviving funds overstates industry performance. Studies show survivorship bias inflates average returns by 1–3 percentage points per year in active fund databases.

How it appears in data science work

  • Training a churn model only on customers with at least 6 months of history excludes customers who churned immediately — the hardest-to-retain group.
  • Benchmarking model performance on products still sold today, ignoring discontinued lines that failed.
  • Scraping job postings to estimate salary ranges: closed postings (filled quickly at lower salaries) are missing.

Detection and remedies

  1. Ask: who is absent from this dataset and why? Map the data-collection process end to end.
  2. Use historical snapshots rather than point-in-time slices.
  3. Apply inverse-probability-of-selection weighting to reweight surviving records toward the full population.
  4. Treat missingness as informative — model the selection mechanism explicitly.

Keep practising

All Statistics & Probability questions

Explore further

Skip to content