What is the correct interpretation of a 95% confidence interval?

A 95% confidence interval means that if you repeated the sampling procedure many times and built an interval each time, 95% of those intervals would contain the true parameter. It does not mean there is a 95% probability that this specific interval contains the parameter.

What is the difference between a biased estimator and an inconsistent estimator?

Bias measures the systematic error of an estimator at a fixed sample size — whether its expected value equals the true parameter. Consistency is an asymptotic property — whether the estimator converges in probability to the true parameter as sample size grows to infinity. An estimator can be biased yet consistent, or unbiased yet inconsistent.

What does the Central Limit Theorem actually say, and why does it matter?

The CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size grows, regardless of the shape of the underlying population distribution. It is the theoretical foundation for confidence intervals, hypothesis tests, and many machine-learning approximations — but it applies to the distribution of the mean, not to the raw data.

What is maximum likelihood estimation, and what is the intuition behind it?

Maximum likelihood estimation finds the parameter values that make the observed data most probable under the assumed model. Intuitively, you ask: given this data, which world would have been most likely to generate it?

Estimation & confidence intervals — Math for ML

The last lesson left us holding exactly one sample. The CLT cheerfully described a whole cloud of possible sample means — average 200 users many times and the averages form a tidy normal around the truth — but in real life you measure once. You time 200 users, the average comes out to 4.2 minutes, and the true average, over all users forever, you will never see. So the only honest question left is the one the CLT set up: your 4.2 is an estimate — how far off could it be?

Estimators are random

An estimator is any rule that turns a sample into a guess: the sample mean estimates the population mean, the sample median estimates the population median. The unsettling part is that the guess is itself a random thing. Run the same rule on a different 200 users and you would get a slightly different number. So the estimate has its own distribution — the sampling distribution the CLT just taught us to expect — and the spread of that distribution has a name, the standard error:

SE(x̄) = σ / √n

That √n is the single most consequential fact in applied statistics: to halve your uncertainty you need four times the data. It is the CLT wearing work clothes — the sampling distribution of the mean is approximately normal, and this is how wide it is.

Bias and variance of an estimate

An estimator can be wrong in two distinct ways, and it is worth keeping them apart — they are the exact same two errors you will later meet in models, here applied to a single number.

Bias — does it systematically miss, sample after sample? The sample mean is unbiased; the sample variance divided by n (instead of n−1) is biased low.
Variance — how much does it bounce between samples? That is SE².
Consistent — does it home in on the truth as n → ∞? The sample mean does: its bias is zero and its variance melts away like 1/n.

Low bias and low variance is the goal; the √n law is the price you pay to buy down the variance.

Confidence intervals — and the trap

Wrap the standard error around the estimate and you get a 95% confidence interval: roughly estimate ± z · SE, with z ≈ 1.96 (that number again — the 97.5th percentile of the normal the CLT handed us). Here is the catch the previous lesson dared you to answer, and almost everyone gets it wrong.

A 95% CI does not mean “there’s a 95% probability the true value is in this interval.” The true value is a fixed number — it is either in your interval or it is not; there is no probability left in it. The random thing is the interval itself, which jumps around as your sample does. So the honest statement is about the procedure, not this one run:

If you repeated the whole experiment many times, about 95% of the intervals you’d construct would contain the true value.

That is the whole resolution of the trap: the 95% is a property of the method, the long-run hit rate of your interval-building recipe — not a belief about the one interval in front of you.

n30

About 1 in 20 lines misses the true mean — exactly the 5% you signed up for. Raise n and every interval gets tighter (the √n at work), yet the miss rate holds stubbornly at 5%. Coverage is set by the 95% you chose, not by how much data you have; data buys you precision, not certainty.

The bootstrap: a CI with no formula

SE = σ/√n is lovely for the mean — but what about a median, a 90th percentile, a model’s F1 score? Most statistics have no tidy standard-error formula. The bootstrap sidesteps the algebra entirely: treat your sample as a stand-in for the population, resample it with replacement thousands of times, recompute the statistic on each fake sample, and read the interval straight off the 2.5th and 97.5th percentiles of the results.

import numpy as np
rng = np.random.default_rng(0)
data = rng.normal(4.2, 1.5, 200)        # 200 observed session times

# Classic CI for the mean (formula available)
se = data.std(ddof=1) / np.sqrt(len(data))
print("mean:", data.mean().round(3), " 95% CI:", (data.mean() - 1.96*se).round(3), "to", (data.mean() + 1.96*se).round(3))

# Bootstrap CI for the MEDIAN (no simple formula)
boot = [np.median(rng.choice(data, len(data), replace=True)) for _ in range(5000)]
print("median 95% bootstrap CI:", np.percentile(boot, [2.5, 97.5]).round(3))

mean: 4.223  95% CI: 4.023 to 4.423
median 95% bootstrap CI: [3.981 4.53 ]

The mean’s interval came from a formula; the median’s came from nothing but resampling — and yet both are honest 95% intervals. The bootstrap is the Swiss-army CI: when no formula exists, the data will simulate its own sampling distribution for you.

Where this lives in ML

Reporting metrics with error bars instead of a single bare accuracy number.
A/B testing — the difference in conversion comes with a CI; if it straddles zero, you have no winner yet.
Model comparison — is model A really better, or just bouncing inside the noise?
“We need more data” — the √n law quantifies exactly how much more buys how much less uncertainty, with sharply diminishing returns.

In one breath

An estimator turns a sample into a guess, and because a different sample gives a different guess, the estimate is itself random — its spread is the standard error, SE = σ/√n, the CLT in work clothes (halve the uncertainty, pay 4× the data). Estimators can err two ways: bias (systematic miss) and variance (bounce between samples); the sample mean is unbiased, low-variance, and consistent. A 95% confidence interval estimate ± 1.96·SE is the most-misread object in statistics: the truth is fixed, the interval is random, so 95% is the long-run fraction of such intervals that capture the truth — not the probability that this one does. When a statistic has no SE formula (a median, a model score), the bootstrap resamples the data with replacement thousands of times and reads the interval off the 2.5th/97.5th percentiles.

Practice

Quick check

0/3

Q1You compute a 95% CI of [4.0, 4.4] for the mean. Which statement is correct?

Q2To cut your standard error in half, you need…

Q3When would you reach for a bootstrap confidence interval?

A question to carry forward

A confidence interval answers “how much do I trust this one number?” But look again at the list above — almost every real use was secretly a comparison. Did variant B beat variant A? Is model A genuinely better than B? And we slipped a rule of thumb past you without justifying it: “if the interval for the difference straddles zero, you have no winner.”

That rule is the seed of an entire framework. So here is the thread onward: how do you turn “is this difference real, or just the noise the CLT predicts?” into a decision with a controlled error rate? What is a null hypothesis, what exactly is a p-value measuring (it is not the probability your hypothesis is true — another trap), and why does “straddles zero” turn out to be the same statement as “not significant at the 5% level”?

Estimation & confidence intervals

What you'll learn

Before you start