Simpson's Paradox
A trend that holds in every subgroup can reverse when the groups are combined. The cause is always a lurking variable — and missing it flips your conclusion.
What you'll learn
- Why a treatment can win in every subgroup yet lose in the aggregate
- How an unevenly distributed confounder (lurking variable) creates the reversal
- Why causal structure — not the bigger number — determines the right answer
Before you start
In 1986, a study compared two kidney-stone treatments. Every doctor who looked at the per-case breakdown preferred Treatment A. The hospital administrator who looked at the summary table preferred Treatment B. They were both reading the same data. One of them was wrong — and the mistake is not obvious until you understand what a lurking variable can do.
The numbers that started a debate
Patients were split by stone size — small or large — and treated with either Treatment A (open surgery) or Treatment B (percutaneous nephrolithotomy, a less invasive procedure). Here are the cure counts:
| Group | Treatment A | Treatment B |
|---|---|---|
| Small stones | 81 cured / 87 patients (93%) | 234 cured / 270 patients (87%) |
| Large stones | 192 cured / 263 patients (73%) | 55 cured / 80 patients (69%) |
| Combined | 273 cured / 350 patients (78%) | 289 cured / 350 patients (83%) |
Read the subgroups: A beats B for small stones (93% vs 87%) and for large stones (73% vs 69%). Now read the combined row: B beats A overall (83% vs 78%).
A wins in every subgroup. B wins in the aggregate. Both statements are arithmetically true. This is Simpson’s paradox.
Why the reversal happens
The reversal is not a calculation error. It comes from one fact buried in the patient counts: the two treatments did not get the same mix of cases.
- Treatment A received 263 large-stone patients out of its 350 total — 75% hard cases.
- Treatment B received 270 small-stone patients out of its 350 total — 77% easy cases.
Stone size is the confounder — a variable that influences both which treatment a patient received (doctors rightly gave the safer procedure to easier cases) and the outcome (large stones are harder to cure regardless of treatment).
When you pool the groups without accounting for stone size, Treatment B’s score gets inflated by the sheer number of easy cases it handled. The aggregate mixes apples and oranges, then shows you the average color.
Within each stone-size group, Treatment A’s bar is taller. In the combined view (faded bars), the order flips because Treatment B handled far more easy cases.
The mental model: weighted averages hide who did the hard work
Think of each treatment’s overall rate as a weighted average of its subgroup rates, where the weights are the fraction of patients in each subgroup.
Treatment A’s overall rate:
(81/87) * (87/350) + (192/263) * (263/350)
= 0.931 * 0.249 + 0.730 * 0.751
= 0.232 + 0.548
= 0.780 → 78%
Treatment B’s overall rate:
(234/270) * (270/350) + (55/80) * (80/350)
= 0.867 * 0.771 + 0.688 * 0.229
= 0.668 + 0.157
= 0.826 → 83%
A carries 75% of its weight on the hard subgroup (large stones, 73% cure rate). B carries 77% of its weight on the easy subgroup (small stones, 87% cure rate). The aggregate just reflects who got the easier assignment — not which treatment is more effective.
Verify it yourself
Run it. You will see:
- A wins both subgroups.
- B wins overall.
- A was assigned 75% hard cases; B was assigned 77% easy cases.
The right answer depends on causal structure
Simpson’s paradox forces a question the data alone cannot answer: which number should I act on?
The answer depends on why the confounder was distributed the way it was.
Case 1 — the confounder is a pre-treatment difference (as here). Doctors assigned treatments based on stone size before the trial ran. Stone size is a true confounder that sits on a causal path between “which treatment” and “outcome.” The subgroup rates are the right answer. You should prefer A because it performs better on a like-for-like comparison. The aggregate is misleading.
Case 2 — the confounder is caused by the treatment. Suppose the treatment itself caused patients to be categorized differently post-treatment (e.g., a drug changes a biomarker that was used to re-stratify patients). Then the subgroup rates are the misleading ones — you are adjusting away the treatment’s effect. The aggregate is closer to correct.
There is no mechanical rule. You need a causal diagram, not a bigger dataset or a smarter average.
Where this shows up in ML and data work
- Model evaluation split by subgroup vs. aggregate accuracy — a model can achieve high overall accuracy while being worse than baseline for every demographic subgroup if the subgroups have different base rates.
- A/B testing with unequal traffic splits across segments — if high- converting segments receive more of one variant, the aggregate conversion rate is a biased comparison.
- Feature importance in pooled vs. stratified models — a feature that appears predictive in the aggregate can be a proxy for the confounder, not a causal signal.
Whenever you see a “combined” number, ask: are the groups being combined exchangeable? If one group is systematically harder, the aggregate is not a fair average — it is a weighted sum that reflects the case mix, not the treatment effect.
Next
Correlation vs. causation — why two variables moving together does not mean one causes the other, and how confounders are the most common reason for spurious correlations.
Quick check
Practice this in an interview
All questionsSimpson's paradox occurs when a trend that appears in several subgroups disappears or reverses when those subgroups are combined. It arises because a lurking variable (the group size itself, correlated with both treatment and outcome) distorts the aggregate.
Simpson's paradox occurs when a trend present in every subgroup reverses or disappears in the aggregate, because the subgroup sizes differ between treatment and control. In A/B tests it most often appears when the randomization is imbalanced across a strong confounding variable.
A confounding variable is associated with both the treatment and the outcome, creating a spurious apparent relationship between them. Controlling for confounders — through randomisation, stratification, regression adjustment, or matching — is essential to recover a valid causal estimate.
Correlation measures the strength of a linear relationship between two variables, but a shared cause, reverse causation, or coincidence can all produce correlation without any causal link. Treating correlation as causation leads to interventions that fail or cause harm.