datarekha
Statistics & Probability Hard Asked at MetaAsked at GoogleAsked at AmazonAsked at Booking

What is the peeking problem in A/B testing, and how do you handle it?

The short answer

Peeking means checking statistical significance repeatedly during a test and stopping as soon as p drops below 0.05. Because the p-value fluctuates over time, this inflates the false-positive rate well above the nominal alpha — sometimes to 25–30 % for daily checks over a two-week test.

How to think about it

Why peeking inflates errors

Under the null hypothesis, the p-value follows a uniform distribution at any single look. But if you check repeatedly and stop the moment p < 0.05, you are taking the minimum of many uniform draws — and the minimum of even 5 draws has roughly a 23 % chance of falling below 0.05. You have not made your test more powerful; you have made it far more prone to false alarms.

α=0.050.0501Time (days)p-valuepeekpeekpeek
p-value random walk under H₀. Each crossing of alpha=0.05 is a potential false-positive stop.

Solutions

  1. Pre-commit to a single analysis at the planned horizon. Simplest and most powerful. Do not look at results until the experiment is done.

  2. Sequential testing / always-valid p-values. Frameworks such as mSPRT (mixture Sequential Probability Ratio Test) or the method used by Spotify and Booking.com provide error guarantees that hold at any stopping time. The trade-off is slightly lower power for a given sample compared to a fixed-horizon test.

  3. Alpha spending functions (O’Brien-Fleming). Borrowed from clinical trials. You budget the total alpha across planned interim looks, using very conservative thresholds early (e.g., p < 0.001 at 50 % of data) and relaxing them toward the end.

  4. Bayesian testing with a credible interval. Decisions are based on posterior probability that the lift exceeds the MDE, not on a p-value. There is no formal multiple-comparison problem in the Bayesian sense, though you can still overfit priors.

In practice, the pragmatic answer is: use a sequential testing framework if your organization requires continuous monitoring, and educate stakeholders that a significant result on day 3 of a 14-day test is not a valid stopping criterion.

Learn it properly A/B testing

Keep practising

All Statistics & Probability questions

Explore further

Skip to content