What is Simpson's paradox? Walk through a concrete example.

Simpson's paradox occurs when a trend that appears in several subgroups disappears or reverses when those subgroups are combined. It arises because a lurking variable (the group size itself, correlated with both treatment and outcome) distorts the aggregate.

How can Simpson's paradox affect A/B test results, and how do you detect and resolve it?

Simpson's paradox occurs when a trend present in every subgroup reverses or disappears in the aggregate, because the subgroup sizes differ between treatment and control. In A/B tests it most often appears when the randomization is imbalanced across a strong confounding variable.

What is a confounding variable, and how do you control for it?

A confounding variable is associated with both the treatment and the outcome, creating a spurious apparent relationship between them. Controlling for confounders — through randomisation, stratification, regression adjustment, or matching — is essential to recover a valid causal estimate.

What is the difference between correlation and causation, and why does the distinction matter?

Correlation measures the strength of a linear relationship between two variables, but a shared cause, reverse causation, or coincidence can all produce correlation without any causal link. Treating correlation as causation leads to interventions that fail or cause harm.

Simpson's Paradox — Math for ML

The last lesson left the whole chapter under suspicion. We spent it turning relationships into trustworthy numbers — covariance, correlation, mutual information, KL — and then warned that a hidden variable can make any of them lie. This lesson is that warning made vivid, with real data and real stakes.

In 1986, a study compared two kidney-stone treatments. Every doctor who looked at the per-case breakdown preferred Treatment A. The hospital administrator who looked at the summary table preferred Treatment B. They were reading the same data. One of them was wrong — and the mistake is invisible until you see what a lurking variable can do.

The numbers that started a debate

Patients were split by stone size — small or large — and treated with either Treatment A (open surgery) or Treatment B (percutaneous nephrolithotomy, a less invasive procedure). Here are the cure counts:

Group	Treatment A	Treatment B
Small stones	81 cured / 87 patients (93%)	234 cured / 270 patients (87%)
Large stones	192 cured / 263 patients (73%)	55 cured / 80 patients (69%)
Combined	273 cured / 350 patients (78%)	289 cured / 350 patients (83%)

Read the subgroups: A beats B for small stones (93% vs 87%) and for large stones (73% vs 69%). Now read the combined row: B beats A overall (83% vs 78%).

A wins in every subgroup. B wins in the aggregate. Both statements are arithmetically true. This is Simpson’s paradox.

Why the reversal happens

It is not a calculation error. It comes from one fact buried in the patient counts: the two treatments did not get the same mix of cases.

Treatment A received 263 large-stone patients of its 350 total — 75% hard cases.
Treatment B received 270 small-stone patients of its 350 total — 77% easy cases.

Stone size is the confounder — a variable that influences both which treatment a patient received (doctors rightly gave the safer procedure to easier cases) and the outcome (large stones are harder to cure regardless of treatment). Pool the groups without accounting for it, and Treatment B’s score gets inflated by the sheer number of easy cases it handled. The aggregate mixes apples and oranges, then reports you the average color.

Cure rate by group — the reversal made visible

100%75%50%25%93%87%73%69%78%83%Small stonesLarge stonesCombinedA winsA winsB wins ← reversalTreatment ATreatment B

Within each stone-size group, Treatment A’s bar is taller. In the combined view (faded bars), the order flips because Treatment B handled far more easy cases.

The mental model: weighted averages hide who did the hard work

Think of each treatment’s overall rate as a weighted average of its subgroup rates, where the weights are the fraction of patients in each subgroup.

Treatment A’s overall rate:

(81/87) * (87/350)  +  (192/263) * (263/350)
=  0.931 * 0.249    +  0.730 * 0.751
=  0.232            +  0.548
=  0.780   →  78%

Treatment B’s overall rate:

(234/270) * (270/350)  +  (55/80) * (80/350)
=  0.867 * 0.771       +  0.688 * 0.229
=  0.668               +  0.157
=  0.826   →  83%

A carries 75% of its weight on the hard subgroup (large stones, 73% cure rate). B carries 77% of its weight on the easy subgroup (small stones, 87% cure rate). The aggregate just reflects who got the easier assignment — not which treatment is more effective.

Verify it yourself

# Kidney-stone treatment — Simpson's Paradox
# Treatment A
a_small_cured, a_small_total = 81, 87
a_large_cured, a_large_total = 192, 263

# Treatment B
b_small_cured, b_small_total = 234, 270
b_large_cured, b_large_total = 55, 80

# Subgroup rates
a_small_rate = a_small_cured / a_small_total
a_large_rate = a_large_cured / a_large_total
b_small_rate = b_small_cured / b_small_total
b_large_rate = b_large_cured / b_large_total

# Overall (pooled) rates
a_total_cured = a_small_cured + a_large_cured
a_total_n     = a_small_total + a_large_total
b_total_cured = b_small_cured + b_large_cured
b_total_n     = b_small_total + b_large_total

a_overall = a_total_cured / a_total_n
b_overall = b_total_cured / b_total_n

print("--- Subgroup cure rates ---")
print("Small stones:  A = " + str(round(a_small_rate * 100, 1)) + "%   B = " + str(round(b_small_rate * 100, 1)) + "%")
print("Large stones:  A = " + str(round(a_large_rate * 100, 1)) + "%   B = " + str(round(b_large_rate * 100, 1)) + "%")
print("")
print("--- Who wins each subgroup? ---")
print("Small: " + ("A" if a_small_rate > b_small_rate else "B"))
print("Large: " + ("A" if a_large_rate > b_large_rate else "B"))
print("")
print("--- Overall (pooled) cure rates ---")
print("A overall: " + str(a_total_cured) + "/" + str(a_total_n) + " = " + str(round(a_overall * 100, 1)) + "%")
print("B overall: " + str(b_total_cured) + "/" + str(b_total_n) + " = " + str(round(b_overall * 100, 1)) + "%")
print("Overall winner: " + ("A" if a_overall > b_overall else "B"))
print("")
print("--- Why? Case-mix (lurking variable = stone size) ---")
a_pct_large = a_large_total / a_total_n * 100
b_pct_small = b_small_total / b_total_n * 100
print("A got " + str(round(a_pct_large, 0)) + "% large-stone (hard) cases")
print("B got " + str(round(b_pct_small, 0)) + "% small-stone (easy) cases")

--- Subgroup cure rates ---
Small stones:  A = 93.1%   B = 86.7%
Large stones:  A = 73.0%   B = 68.8%

--- Who wins each subgroup? ---
Small: A
Large: A

--- Overall (pooled) cure rates ---
A overall: 273/350 = 78.0%
B overall: 289/350 = 82.6%
Overall winner: B

--- Why? Case-mix (lurking variable = stone size) ---
A got 75.0% large-stone (hard) cases
B got 77.0% small-stone (easy) cases

A wins both subgroups; B wins overall; A was handed 75% hard cases while B got 77% easy ones. The arithmetic is flawless and the conclusion still flips.

The right answer depends on causal structure

Simpson’s paradox forces a question the data alone cannot answer: which number should I act on? And that depends on why the confounder was distributed the way it was.

Case 1 — the confounder is a pre-treatment difference (as here). Doctors assigned treatments by stone size before the trial ran. Stone size sits on the causal path between “which treatment” and “outcome,” so it is a true confounder. The subgroup rates are the right answer: prefer A, which wins the like-for-like comparison. The aggregate is the misleading one.

Case 2 — the confounder is caused by the treatment. Suppose the treatment itself changed how patients were categorized afterward (a drug shifts a biomarker later used to re-stratify them). Then splitting by that variable adjusts away the treatment’s own effect, and the aggregate is closer to correct.

There is no mechanical rule. You need a causal diagram — not a bigger dataset and not a smarter average. The bigger number is not automatically the truer one.

Where this shows up in ML and data work

Model evaluation, subgroup vs. aggregate — a model can post high overall accuracy while being worse than baseline for every demographic subgroup, when the subgroups have different base rates.
A/B testing with unequal segment splits — if high-converting segments receive more of one variant, the aggregate conversion comparison is biased (exactly the confounding the A/B-testing lesson warned about).
Feature importance, pooled vs. stratified — a feature that looks predictive in aggregate can be a proxy for the confounder, not a causal signal.

Whenever you meet a “combined” number, ask: are the groups being combined exchangeable? If one group is systematically harder, the aggregate is not a fair average — it is a weighted sum that reflects the case mix.

In one breath

Simpson’s paradox is a trend that holds in every subgroup yet reverses when the groups are pooled — A beats B for small stones (93% vs 87%) and large stones (73% vs 69%), yet B wins the combined rate (83% vs 78%). No arithmetic is wrong; the cause is a confounder (here, stone size) unevenly split across groups: each overall rate is a weighted average of subgroup rates, and B’s average is inflated because it handled 77% easy cases while A handled 75% hard ones. Which number to trust depends on causal structure — if the confounder is pre-treatment, the subgroup rates are right (prefer A); if the treatment causes the split, the aggregate is. The fix is a causal diagram, not a bigger dataset — and it bites in model evaluation, A/B tests, and feature analysis alike.

Practice

Quick check

0/3

Q1In the kidney-stone study, Treatment A cures 93% of small-stone cases and 73% of large-stone cases, yet its overall rate is only 78%. What is the main reason the overall rate is pulled down so far?

Q2A school reports that girls outperform boys in every individual department (Science, Arts, Commerce) but boys have a higher overall grade average. What single fact would most likely explain this pattern?

Q3A data scientist notices that, aggregated across all users, Feature X has a strong positive correlation with churn. But when users are split by account age (new vs. veteran), Feature X shows no correlation with churn in either group. What is the most likely cause?

A question to carry forward

Simpson’s paradox closes the probability-and-statistics arc on a humbling note: the arithmetic was perfect, every division correct to the last digit, and the conclusion still betrayed us — because of something we failed to account for, a variable hiding in the structure of the data. Hold onto that shape — correct computation, wrong answer — because the final stretch of this course is about a second, entirely different way it happens.

Every formula in every lesson so far assumed the machine would simply do the math. It will not, not exactly. A computer cannot store most real numbers; it keeps a finite handful of digits and rounds the rest away, and those crumbs of rounding can avalanche — a subtraction that cancels to noise, a sum that loses its small terms, a softmax that overflows to infinity. So here is the thread into the home stretch: what is numerical stability, why does honest mathematics produce dishonest answers in floating-point, and which everyday operations in ML are quietly rewritten — the log-sum-exp trick, the “+1e-9” you keep seeing — to keep the arithmetic from lying to you?

Simpson's Paradox

What you'll learn

Before you start

The numbers that started a debate

Why the reversal happens

The mental model: weighted averages hide who did the hard work

Verify it yourself

The right answer depends on causal structure

Where this shows up in ML and data work

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further