What is the peeking problem in A/B testing, and how do you handle it?
Peeking means checking statistical significance repeatedly during a test and stopping as soon as p drops below 0.05. Because the p-value fluctuates over time, this inflates the false-positive rate well above the nominal alpha — sometimes to 25–30 % for daily checks over a two-week test.
How to think about it
Why peeking inflates errors
Under the null hypothesis, the p-value follows a uniform distribution at any single look. But if you check repeatedly and stop the moment p < 0.05, you are taking the minimum of many uniform draws — and the minimum of even 5 draws has roughly a 23 % chance of falling below 0.05. You have not made your test more powerful; you have made it far more prone to false alarms.
Solutions
-
Pre-commit to a single analysis at the planned horizon. Simplest and most powerful. Do not look at results until the experiment is done.
-
Sequential testing / always-valid p-values. Frameworks such as mSPRT (mixture Sequential Probability Ratio Test) or the method used by Spotify and Booking.com provide error guarantees that hold at any stopping time. The trade-off is slightly lower power for a given sample compared to a fixed-horizon test.
-
Alpha spending functions (O’Brien-Fleming). Borrowed from clinical trials. You budget the total alpha across planned interim looks, using very conservative thresholds early (e.g., p < 0.001 at 50 % of data) and relaxing them toward the end.
-
Bayesian testing with a credible interval. Decisions are based on posterior probability that the lift exceeds the MDE, not on a p-value. There is no formal multiple-comparison problem in the Bayesian sense, though you can still overfit priors.
In practice, the pragmatic answer is: use a sequential testing framework if your organization requires continuous monitoring, and educate stakeholders that a significant result on day 3 of a 14-day test is not a valid stopping criterion.