What does the Central Limit Theorem actually say, and why does it matter?
The CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size grows, regardless of the shape of the underlying population distribution. It is the theoretical foundation for confidence intervals, hypothesis tests, and many machine-learning approximations — but it applies to the distribution of the mean, not to the raw data.
How to think about it
Nail the exact statement first — the CLT is about the sample mean, not the data — then give the practical convergence rate and the conditions under which it breaks down.
The precise statement
Let X₁, X₂, …, Xₙ be i.i.d. random variables with mean μ and finite variance σ². Define the sample mean as X_bar = (1/n) * sum(X_i). Then:
sqrt(n) * (X_bar - mu) / sigma → N(0, 1) as n → ∞
Equivalently, X_bar ~ N(mu, sigma^2 / n) for large n. The standard deviation of this sampling distribution is sigma / sqrt(n) — the standard error of the mean.
How large does n need to be?
A common rule of thumb is n ≥ 30, but this depends on skewness. For a nearly symmetric distribution, n = 10 may suffice. For highly skewed data (e.g., income, click-through rates), n may need to be in the hundreds before the approximation is reliable. Bootstrap or permutation methods are safer when the approximation is uncertain.
Why it matters in practice
- A/B testing: Even if metric distributions are skewed (e.g., revenue per user), mean differences between large groups are approximately normal — enabling t-tests and z-tests.
- Confidence intervals: The
X_bar ± 1.96 * SEformula assumes normality of the mean, not the data. - Batch gradient descent: With large mini-batches, gradient noise is approximately Gaussian.
Conditions that can break it
- Infinite variance (Cauchy distribution, Pareto with α ≤ 2): the CLT does not apply.
- Dependent observations (time series, clustered data): the i.i.d. assumption is violated.
- Extremely heavy tails with small samples.