Statistics & Probability Medium Asked at GoogleAsked at NetflixAsked at AirbnbAsked at Stripe

What is bootstrapping, and when should you use resampling methods?

For Data Scientist Data Analyst ML Engineer

The short answer

Bootstrapping estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each resample. It works when the analytic sampling distribution is unknown, intractable, or the sample size is too small for asymptotic approximations to hold.

How to think about it

Bootstrapping is a simulation-based method for quantifying uncertainty without relying on closed-form distributional assumptions. It exploits the empirical distribution of your data as a proxy for the true population distribution.

The algorithm

Given n observed data points x₁, …, xₙ:

Draw a resample of size n with replacement from {x₁, …, xₙ}.
Compute the statistic of interest θ̂* on the resample.
Repeat B times (typically B = 1 000 to 10 000).
The empirical distribution of {θ̂₁, …, θ̂ₙ} approximates the sampling distribution of θ̂.

Worked example — median salary confidence interval

You have salary data for 80 employees. The sampling distribution of the median has no simple closed form. With B = 2 000 bootstrap resamples:

Each resample draws 80 values with replacement.
Compute the median of each resample → 2 000 bootstrap medians.
Sort them. The 2.5th and 97.5th percentiles form a 95% percentile bootstrap CI.

If the bootstrap medians range from $72 000 to $91 000, your 95% CI is [$72 000, $91 000] — no formula for the median’s standard error required.

Types of bootstrap confidence intervals

Method	How	When to use
Percentile	2.5th and 97.5th quantiles of bootstrap distribution	Simple; works when bootstrap distribution is symmetric
Basic (reflected)	2θ̂ - quantiles	Corrects for shift bias
BCa (bias-corrected accelerated)	Adjusts for bias and skewness	Most accurate; preferred for production use
Studentised	Standardises each resample by its own SE	Best finite-sample coverage; computationally expensive

When to use bootstrap vs analytic methods

Use bootstrap when:

The statistic has no known analytic sampling distribution (median, Gini coefficient, correlation difference).
The sample size is small and asymptotic normality has not kicked in.
The data are not normally distributed and parametric assumptions are questionable.

Use analytic methods when:

The sampling distribution is well-established (mean with large n, proportion with normal approximation).
Computational cost matters and the analytic formula is accurate.

Permutation testing

A related resampling method — shuffle labels between groups to build the null distribution for a two-sample test. Unlike bootstrap (which targets confidence intervals), permutation tests target p-values and are exact under the exchangeability assumption.

Learn it properly A/B testing