What is p-hacking and how does multiple testing inflate false-positive rates?
P-hacking is the practice of making analytic choices — selecting metrics, segments, or time windows — after seeing data, guided by which choices produce p < 0.05. Multiple testing means that even without intent, testing many hypotheses at alpha = 0.05 expects one false positive per 20 tests.
How to think about it
The arithmetic of multiple testing
If you run k independent tests each at alpha = 0.05, the probability that at least one produces a false positive under the global null is:
FWER = 1 - (1 - 0.05)^k
For k = 5: FWER = 23 %. For k = 20: FWER = 64 %. Running a dashboard with 20 metrics and declaring victory on whichever one turns green means you should expect roughly one false positive just by chance.
Common p-hacking patterns to recognize
- Slicing by user segments (mobile vs. desktop, new vs. returning) until a sub-group “works,” then presenting only that slice.
- Changing the metric definition mid-experiment (e.g., switching from 7-day retention to 14-day after seeing 7-day miss).
- Extending the test window after it misses significance, then stopping when it finally crosses.
- Including or excluding a date range due to an “anomaly” that was not pre-specified.
- Testing multiple variants and reporting only the best-performing one against control.
Corrections and mitigations
| Approach | What it controls | Trade-off |
|---|---|---|
| Bonferroni | Family-wise error rate (FWER) | Very conservative; low power |
| Holm-Bonferroni | FWER | Less conservative than Bonferroni |
| Benjamini-Hochberg | False discovery rate (FDR) | Allows expected % of false positives |
| Pre-registration | All of the above by design | Requires discipline before launch |
For a standard product experiment with 3–5 metrics, Bonferroni is practical: divide alpha by the number of tests (e.g., 0.05 / 5 = 0.01 per test). For exploratory analyses across many segments, Benjamini-Hochberg controlling FDR at 10 % is a reasonable choice.
Pre-registration is the strongest fix
Commit in writing — in the experiment system — to the primary metric, the secondary metrics, the segments you will analyze, and the statistical test, all before launch. Post-hoc analysis is labeled “exploratory” and cannot be used as the basis for a ship decision without a confirmatory test.
A secondary safeguard is to require replication: any result that was not pre-registered as primary should be validated in a follow-up experiment before shipping.