datarekha

Simpson's Paradox

A trend that holds in every subgroup can reverse when the groups are combined. The cause is always a lurking variable — and missing it flips your conclusion.

8 min read Intermediate Math for ML Lesson 17 of 18

What you'll learn

  • Why a treatment can win in every subgroup yet lose in the aggregate
  • How an unevenly distributed confounder (lurking variable) creates the reversal
  • Why causal structure — not the bigger number — determines the right answer

Before you start

In 1986, a study compared two kidney-stone treatments. Every doctor who looked at the per-case breakdown preferred Treatment A. The hospital administrator who looked at the summary table preferred Treatment B. They were both reading the same data. One of them was wrong — and the mistake is not obvious until you understand what a lurking variable can do.

The numbers that started a debate

Patients were split by stone size — small or large — and treated with either Treatment A (open surgery) or Treatment B (percutaneous nephrolithotomy, a less invasive procedure). Here are the cure counts:

GroupTreatment ATreatment B
Small stones81 cured / 87 patients (93%)234 cured / 270 patients (87%)
Large stones192 cured / 263 patients (73%)55 cured / 80 patients (69%)
Combined273 cured / 350 patients (78%)289 cured / 350 patients (83%)

Read the subgroups: A beats B for small stones (93% vs 87%) and for large stones (73% vs 69%). Now read the combined row: B beats A overall (83% vs 78%).

A wins in every subgroup. B wins in the aggregate. Both statements are arithmetically true. This is Simpson’s paradox.

Why the reversal happens

The reversal is not a calculation error. It comes from one fact buried in the patient counts: the two treatments did not get the same mix of cases.

  • Treatment A received 263 large-stone patients out of its 350 total — 75% hard cases.
  • Treatment B received 270 small-stone patients out of its 350 total — 77% easy cases.

Stone size is the confounder — a variable that influences both which treatment a patient received (doctors rightly gave the safer procedure to easier cases) and the outcome (large stones are harder to cure regardless of treatment).

When you pool the groups without accounting for stone size, Treatment B’s score gets inflated by the sheer number of easy cases it handled. The aggregate mixes apples and oranges, then shows you the average color.

Cure rate by group — the reversal made visible

100%75%50%25%93%87%73%69%78%83%Small stonesLarge stonesCombinedA winsA winsB wins ← reversalTreatment ATreatment B

Within each stone-size group, Treatment A’s bar is taller. In the combined view (faded bars), the order flips because Treatment B handled far more easy cases.

The mental model: weighted averages hide who did the hard work

Think of each treatment’s overall rate as a weighted average of its subgroup rates, where the weights are the fraction of patients in each subgroup.

Treatment A’s overall rate:

(81/87) * (87/350)  +  (192/263) * (263/350)
=  0.931 * 0.249    +  0.730 * 0.751
=  0.232            +  0.548
=  0.780   →  78%

Treatment B’s overall rate:

(234/270) * (270/350)  +  (55/80) * (80/350)
=  0.867 * 0.771       +  0.688 * 0.229
=  0.668               +  0.157
=  0.826   →  83%

A carries 75% of its weight on the hard subgroup (large stones, 73% cure rate). B carries 77% of its weight on the easy subgroup (small stones, 87% cure rate). The aggregate just reflects who got the easier assignment — not which treatment is more effective.

Verify it yourself

Run it. You will see:

  • A wins both subgroups.
  • B wins overall.
  • A was assigned 75% hard cases; B was assigned 77% easy cases.

The right answer depends on causal structure

Simpson’s paradox forces a question the data alone cannot answer: which number should I act on?

The answer depends on why the confounder was distributed the way it was.

Case 1 — the confounder is a pre-treatment difference (as here). Doctors assigned treatments based on stone size before the trial ran. Stone size is a true confounder that sits on a causal path between “which treatment” and “outcome.” The subgroup rates are the right answer. You should prefer A because it performs better on a like-for-like comparison. The aggregate is misleading.

Case 2 — the confounder is caused by the treatment. Suppose the treatment itself caused patients to be categorized differently post-treatment (e.g., a drug changes a biomarker that was used to re-stratify patients). Then the subgroup rates are the misleading ones — you are adjusting away the treatment’s effect. The aggregate is closer to correct.

There is no mechanical rule. You need a causal diagram, not a bigger dataset or a smarter average.

Where this shows up in ML and data work

  • Model evaluation split by subgroup vs. aggregate accuracy — a model can achieve high overall accuracy while being worse than baseline for every demographic subgroup if the subgroups have different base rates.
  • A/B testing with unequal traffic splits across segments — if high- converting segments receive more of one variant, the aggregate conversion rate is a biased comparison.
  • Feature importance in pooled vs. stratified models — a feature that appears predictive in the aggregate can be a proxy for the confounder, not a causal signal.

Whenever you see a “combined” number, ask: are the groups being combined exchangeable? If one group is systematically harder, the aggregate is not a fair average — it is a weighted sum that reflects the case mix, not the treatment effect.

Next

Correlation vs. causation — why two variables moving together does not mean one causes the other, and how confounders are the most common reason for spurious correlations.

Quick check

0/3
Q1In the kidney-stone study, Treatment A cures 93% of small-stone cases and 73% of large-stone cases, yet its overall rate is only 78%. What is the main reason the overall rate is pulled down so far?
Q2A school reports that girls outperform boys in every individual department (Science, Arts, Commerce) but boys have a higher overall grade average. What single fact would most likely explain this pattern?
Q3A data scientist notices that, aggregated across all users, Feature X has a strong positive correlation with churn. But when users are split by account age (new vs. veteran), Feature X shows no correlation with churn in either group. What is the most likely cause?

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content