datarekha
Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at StripeAsked at Netflix

What does the Central Limit Theorem actually say, and why does it matter?

The short answer

The CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size grows, regardless of the shape of the underlying population distribution. It is the theoretical foundation for confidence intervals, hypothesis tests, and many machine-learning approximations — but it applies to the distribution of the mean, not to the raw data.

How to think about it

Nail the exact statement first — the CLT is about the sample mean, not the data — then give the practical convergence rate and the conditions under which it breaks down.

The precise statement

Let X₁, X₂, …, Xₙ be i.i.d. random variables with mean μ and finite variance σ². Define the sample mean as X_bar = (1/n) * sum(X_i). Then:

sqrt(n) * (X_bar - mu) / sigma → N(0, 1) as n → ∞

Equivalently, X_bar ~ N(mu, sigma^2 / n) for large n. The standard deviation of this sampling distribution is sigma / sqrt(n) — the standard error of the mean.

Population (skewed)right-skewed, any shapen growsSampling dist. of X̄ (normal)N(μ, σ²/n) — always bell-shaped
No matter how skewed the population, the distribution of the sample mean becomes normal as n increases.

How large does n need to be?

A common rule of thumb is n ≥ 30, but this depends on skewness. For a nearly symmetric distribution, n = 10 may suffice. For highly skewed data (e.g., income, click-through rates), n may need to be in the hundreds before the approximation is reliable. Bootstrap or permutation methods are safer when the approximation is uncertain.

Why it matters in practice

  • A/B testing: Even if metric distributions are skewed (e.g., revenue per user), mean differences between large groups are approximately normal — enabling t-tests and z-tests.
  • Confidence intervals: The X_bar ± 1.96 * SE formula assumes normality of the mean, not the data.
  • Batch gradient descent: With large mini-batches, gradient noise is approximately Gaussian.

Conditions that can break it

  • Infinite variance (Cauchy distribution, Pareto with α ≤ 2): the CLT does not apply.
  • Dependent observations (time series, clustered data): the i.i.d. assumption is violated.
  • Extremely heavy tails with small samples.
Learn it properly Central limit theorem

Keep practising

All Statistics & Probability questions

Explore further

Skip to content