What is the multiple comparisons problem, and how does Bonferroni correction address it?
Running many hypothesis tests simultaneously inflates the probability of at least one false positive far above the nominal alpha. Bonferroni correction addresses this by dividing alpha by the number of tests, guaranteeing the family-wise error rate stays at or below alpha — at the cost of reduced power per test.
How to think about it
The multiple comparisons problem is one of the main drivers of false discoveries in data science and scientific research. Any time you run more than one test on the same dataset, you must account for it.
The inflation arithmetic
Suppose each test is run at alpha = 0.05 and all null hypotheses are true. The probability of at least one false positive across m independent tests is:
FWER = 1 - (1 - 0.05)^m
| Tests (m) | FWER |
|---|---|
| 1 | 0.050 |
| 5 | 0.226 |
| 10 | 0.401 |
| 20 | 0.642 |
| 100 | 0.994 |
Running 20 independent tests at alpha = 0.05 gives nearly a 2-in-3 chance of at least one spurious rejection.
Bonferroni correction
The simplest fix: test each comparison at alpha_adjusted = alpha / m.
For alpha = 0.05 and m = 20, test each at 0.0025. This guarantees FWER <= alpha regardless of correlation structure between tests (making it conservative when tests are positively correlated).
Reject H_i if p_i < alpha / m, or equivalently compare adjusted p-values m * p_i to alpha.
Limitations and alternatives
Bonferroni is conservative: When tests are correlated (e.g., related metrics), it over-corrects and loses power.
More powerful alternatives that still control FWER:
- Holm-Bonferroni (step-down): Sort p-values; apply progressively less stringent thresholds. Always at least as powerful as Bonferroni.
Methods that control False Discovery Rate (FDR) rather than FWER — more lenient, higher power:
- Benjamini-Hochberg (BH): Controls the expected proportion of false positives among all rejections. Standard in genomics and large-scale A/B testing platforms.
When does it matter?
- A/B tests with many simultaneous metric comparisons (revenue, DAU, clicks, retention — each is a separate test).
- Post-hoc pairwise comparisons after ANOVA.
- Feature importance screening across hundreds of features.
- Any analysis where you report “the most significant finding” chosen after looking at all results.