How can Simpson's paradox affect A/B test results, and how do you detect and resolve it?
Simpson's paradox occurs when a trend present in every subgroup reverses or disappears in the aggregate, because the subgroup sizes differ between treatment and control. In A/B tests it most often appears when the randomization is imbalanced across a strong confounding variable.
How to think about it
A concrete example
Suppose you are testing a new search ranking algorithm. Overall, the treatment group has a lower click-through rate (CTR) than control. But when you segment by device type:
| Segment | Control CTR | Treatment CTR |
|---|---|---|
| Mobile | 12 % | 14 % |
| Desktop | 28 % | 31 % |
| Overall | 22 % | 20 % |
How? The treatment was launched with a traffic ramp that accidentally sent more mobile users (who have lower baseline CTR) to treatment. Mobile is 80 % of treatment traffic but only 40 % of control traffic. The aggregate treatment average is dragged down by the composition, not by the algorithm being worse.
Why it happens in A/B tests
Pure random user-level assignment should in expectation balance all covariates. But in practice, randomization can be imperfect: bucket collisions in a hashing scheme, time-based launch that inadvertently coincides with a user population shift, or a platform/country that comes online mid-experiment. The result is that a confounding variable (device type, country, user tenure) is unequally distributed between arms.
Detection
Run a Sample Ratio Mismatch (SRM) check first: if the treatment-to-control ratio deviates significantly from the intended split (e.g., you targeted 50/50 but observe 55/45), randomization is broken and no analysis should be trusted until the root cause is found. After confirming no SRM, segment the primary metric by all major covariates and check for directional inconsistency with the aggregate.
Resolution
- If SRM is detected: do not analyze. Find and fix the assignment bug; re-run the experiment.
- If SRM is clean but the paradox is present: report the subgroup-level estimates. Weight by the target population distribution rather than the observed sample composition (post-stratification). The weighted estimate is the correct causal estimate.
In large platforms (Meta, Google), automated SRM detection and stratified reporting are built into the experiment infrastructure precisely because composition imbalances are easy to miss manually.