Statistics & Probability Medium Asked at GoogleAsked at UberAsked at AirbnbAsked at Meta

What is regression to the mean, and why does it fool analysts into seeing treatment effects that do not exist?

For Data Scientist Data Analyst ML Engineer

The short answer

Regression to the mean is the statistical tendency for extreme measurements to be followed by measurements closer to the population mean, purely due to random noise — not because of any intervention. Analysts who intervene after observing an extreme value and then observe improvement often incorrectly attribute the recovery to their action.

How to think about it

Regression to the mean was first documented by Francis Galton in 1886 when studying the heights of parents and children. It is a pure statistical artefact, not a causal process, yet it masquerades as a treatment effect repeatedly in business and medicine.

Why it happens

Any measured value has a true underlying level plus random noise:

observed = true_value + noise

When you select an extreme observed value (say, a very high one), you are implicitly selecting a case where the noise was positive. On the next measurement the noise is again random — on average zero — so the observed value moves back toward the true level. No intervention required.

The strength of regression to the mean depends on the correlation between two measurements. If r is the correlation and z₁ is the standardised first measurement:

E[z₂ | z₁] = r · z₁

When r = 1 (perfect reliability), there is no regression. When r = 0.5, an observation 2 SDs above the mean will on average be only 1 SD above on remeasurement.

Worked example — call centre performance

A manager identifies the bottom 10 agents by call-resolution rate in January and enrolls them in a training programme. In February their rates improve significantly. Did the training work?

Not necessarily. If January performance had a week-to-week correlation of r = 0.6, an agent scoring 2 SDs below average in January is expected to score:

E[February | January z = -2] = 0.6 × (-2) = -1.2 SDs below average

That 0.8 SD improvement is pure regression to the mean — it would have happened without any training.

Diagnosing and controlling for it

Use a control group: select equally poor performers and randomly assign only half to treatment. Compare improvement rates.
Regress on baseline: include the January score as a covariate in the analysis model.
Average multiple baseline measurements: more measurements → more reliable baseline → less regression.
Look at the top performers too: if the effect is regression, top performers will decline by a similar magnitude.

What is regression to the mean, and why does it fool analysts into seeing treatment effects that do not exist?

Why it happens

Worked example — call centre performance

Diagnosing and controlling for it

Keep practising

Explore further