Estimation & confidence intervals
You never see the true parameter — only an estimate from a finite sample. Estimators, standard error, and confidence intervals are how you say how much to trust that number, and how the most-misunderstood interval in statistics actually works.
What you'll learn
- Estimators and their sampling distribution: your estimate is itself random
- Standard error as the spread of an estimate, and why it shrinks like 1/√n
- Bias vs variance of an estimator — and what 'consistent' means
- What a 95% confidence interval really claims (and the interpretation everyone gets wrong)
- The bootstrap — confidence intervals when you have no formula
Before you start
You measure the average session time from 200 users and get 4.2 minutes. But the true average — over all users, forever — you’ll never see. Your 4.2 is an estimate, and the honest next question is: how far off could it be?
Estimators are random
An estimator is any rule that turns a sample into a guess — the sample mean estimates the population mean. Run it on a different sample and you’d get a slightly different number. So the estimate has its own distribution, the sampling distribution, and its spread is the standard error:
SE(x̄) = σ / √n
That √n is the central fact of statistics: to halve your uncertainty you
need four times the data. (This is the CLT at work — the sampling
distribution of the mean is approximately normal.)
Bias and variance of an estimate
- Bias — does the estimator systematically miss? The sample mean is
unbiased; dividing the sample variance by
ninstead ofn−1is biased. - Variance — how much does it bounce around between samples?
SE². - Consistent — does it converge to the truth as
n → ∞? The sample mean does. (Same bias/variance language as models, applied to estimates.)
Confidence intervals — and the trap
A 95% confidence interval is estimate ± z · SE (with z ≈ 1.96). Here’s
the catch almost everyone gets wrong. It does not mean “there’s a 95%
probability the true value is in this interval.” The true value is fixed;
it’s either in or out. What’s random is the interval. The honest
statement:
If you repeated the whole experiment many times, about 95% of the intervals you’d construct would contain the true value.
Watch it happen — each line is one experiment’s interval:
About 1 in 20 misses — exactly the 5% you signed up for. Crank n up and the
intervals get tighter, but the coverage stays 95%.
The bootstrap: a CI with no formula
When there’s no neat formula for SE (a median, a weird metric, a model
score), resample your data with replacement thousands of times, recompute
the statistic each time, and read the interval off the 2.5th and 97.5th
percentiles.
Where this lives in ML
- Reporting metrics with error bars instead of a single accuracy number.
- A/B testing — the difference in conversion comes with a CI; if it straddles zero, you can’t claim a winner.
- Model comparison — is model A really better, or within noise?
- “We need more data” — the
√nlaw tells you how much more buys how much less uncertainty (with sharply diminishing returns).
Quick check
Quick check
Practice this in an interview
All questionsA 95% confidence interval means that if you repeated the sampling procedure many times and built an interval each time, 95% of those intervals would contain the true parameter. It does not mean there is a 95% probability that this specific interval contains the parameter.
Bias measures the systematic error of an estimator at a fixed sample size — whether its expected value equals the true parameter. Consistency is an asymptotic property — whether the estimator converges in probability to the true parameter as sample size grows to infinity. An estimator can be biased yet consistent, or unbiased yet inconsistent.
The CLT states that the sampling distribution of the sample mean converges to a normal distribution as sample size grows, regardless of the shape of the underlying population distribution. It is the theoretical foundation for confidence intervals, hypothesis tests, and many machine-learning approximations — but it applies to the distribution of the mean, not to the raw data.
Maximum likelihood estimation finds the parameter values that make the observed data most probable under the assumed model. Intuitively, you ask: given this data, which world would have been most likely to generate it?