datarekha
Statistics & Probability Medium Asked at MetaAsked at GoogleAsked at AmazonAsked at Booking

How long should you run an A/B test?

The short answer

Run until you have reached the pre-calculated sample size — which should include at least one full weekly cycle to average out day-of-week effects. Stopping early because results look good, or extending because they do not, both inflate error rates.

How to think about it

Duration is a power question first

The minimum runtime is determined by how long it takes to accumulate the sample size from your power calculation. If you need 60,000 users and 10,000 unique users visit per day, the floor is 6 days.

Always span at least one full week

User behavior is strongly seasonal within the week — Monday e-commerce intent differs from Saturday browsing behavior. A test that runs Friday through Sunday may show a spuriously large lift simply because more casual (less purchase-ready) users were in the control group by luck. Spanning at least 7 days averages out that variance. Two weeks is even better for products with strong weekly rhythms.

Do not extend to chase significance

If the test ends at the planned horizon with p = 0.08, the correct answer is “no significant effect at the pre-specified power.” Extending the runtime to squeeze out significance is a form of peeking (see the peeking problem question) and inflates the false-positive rate. The null result at the planned horizon is a valid result.

Do not stop early to save time

Stopping at day 4 because the treatment is already at p = 0.02 will produce a point estimate that is almost certainly inflated (winner’s curse). The observed effect at early stopping tends to be larger than the true population effect.

Practical upper bound

Very long tests (greater than 4–6 weeks) introduce risks of their own: novelty effects decay, the product environment changes, and seasonality (e.g., holiday traffic) contaminates the sample. If you cannot reach sufficient sample size within 4–6 weeks, reconsider whether the MDE is achievable, whether more traffic can be allocated, or whether a different experimental design (e.g., CUPED for variance reduction) is appropriate.

Log the planned end date in the experiment configuration system before launch, and do not revisit it unless there is a legitimate operational reason (e.g., a major bug that invalidated a portion of the data).

Learn it properly A/B testing

Keep practising

All Statistics & Probability questions

Explore further

Skip to content