Statistics & Probability Medium Asked at MetaAsked at GoogleAsked at AmazonAsked at Booking

What is p-hacking and how does multiple testing inflate false-positive rates?

For Data Scientist Data Analyst ML Engineer

The short answer

P-hacking is the practice of making analytic choices — selecting metrics, segments, or time windows — after seeing data, guided by which choices produce p < 0.05. Multiple testing means that even without intent, testing many hypotheses at alpha = 0.05 expects one false positive per 20 tests.

How to think about it

The arithmetic of multiple testing

If you run k independent tests each at alpha = 0.05, the probability that at least one produces a false positive under the global null is:

FWER = 1 - (1 - 0.05)^k

For k = 5: FWER = 23 %. For k = 20: FWER = 64 %. Running a dashboard with 20 metrics and declaring victory on whichever one turns green means you should expect roughly one false positive just by chance.

Common p-hacking patterns to recognize

Slicing by user segments (mobile vs. desktop, new vs. returning) until a sub-group “works,” then presenting only that slice.
Changing the metric definition mid-experiment (e.g., switching from 7-day retention to 14-day after seeing 7-day miss).
Extending the test window after it misses significance, then stopping when it finally crosses.
Including or excluding a date range due to an “anomaly” that was not pre-specified.
Testing multiple variants and reporting only the best-performing one against control.

Corrections and mitigations

Approach	What it controls	Trade-off
Bonferroni	Family-wise error rate (FWER)	Very conservative; low power
Holm-Bonferroni	FWER	Less conservative than Bonferroni
Benjamini-Hochberg	False discovery rate (FDR)	Allows expected % of false positives
Pre-registration	All of the above by design	Requires discipline before launch

For a standard product experiment with 3–5 metrics, Bonferroni is practical: divide alpha by the number of tests (e.g., 0.05 / 5 = 0.01 per test). For exploratory analyses across many segments, Benjamini-Hochberg controlling FDR at 10 % is a reasonable choice.

Pre-registration is the strongest fix

Commit in writing — in the experiment system — to the primary metric, the secondary metrics, the segments you will analyze, and the statistical test, all before launch. Post-hoc analysis is labeled “exploratory” and cannot be used as the basis for a ship decision without a confirmatory test.

A secondary safeguard is to require replication: any result that was not pre-registered as primary should be validated in a follow-up experiment before shipping.

Learn it properly A/B testing

What is p-hacking and how does multiple testing inflate false-positive rates?

Keep practising

Explore further