A/B testing is a sample-size problem wearing a statistics costume

Run a growth experiment for two weeks, collect the data, open your stats tool, see p = 0.08, declare “no effect”, and move on. This is how most A/B tests die — not from fraud, not from bad code, not from a broken null-hypothesis ritual. They die because the test was never capable of seeing what it was looking for. The statistics were fine. The sample was too small. The experiment was doomed on day one.

This failure mode is so common it has a name: an underpowered test (one where the sample is too small to reliably detect a real effect of the size you care about). It quietly accounts for the majority of “no result” experiments in product, marketing, and e-commerce, and it has a straightforward solution that most teams skip.

The lift that was always there

Imagine your checkout conversion rate — the fraction of visitors who complete a purchase — sits at 10.0%. You redesign the checkout flow. Internally, someone says the new design “should be worth at least 3 percentage points.” That is a big claim: going from 10.0% to 13.0% is a 30% relative lift, the kind of thing that moves quarterly revenue materially.

You run the test for five days on a limited traffic slice: 50 visitors to the control, 50 to the treatment. Control converts 5 out of 50 (10%). Treatment converts 7 out of 50 (14%). Looks promising. You run the two-proportion z-test — the standard test for comparing two rates — and get z ≈ 0.64, p ≈ 0.52. Not even close to significant. The experiment is called off.

Three months later you run a proper test: 1,000 visitors per arm over four weeks. Control converts at 10.0%, treatment at 13.0% — same underlying lift. This time z ≈ 2.10, p ≈ 0.035. Statistically significant at the conventional 5% threshold. You ship the redesign.

The lift did not change between those two experiments. The sample size did. The first test was not measuring the effect; it was measuring its own noise floor.

Why small samples can’t see small effects

Statistical significance works by asking: if there were actually no difference between control and treatment, how often would random chance produce a gap this large? A p-value (the probability of observing a result at least this extreme under the null hypothesis of no difference) below 5% is the conventional threshold for rejecting “this is just noise.”

The problem is that random variation in a proportion shrinks as sample size grows — specifically, the standard error of a proportion p measured on n observations is sqrt(p*(1-p)/n). At n = 50 and p = 0.10, that standard error is about 0.042. Two independent arms at that noise level mean your estimate of the gap between them has a standard deviation of roughly 0.060. A true 3-point lift is only half a standard deviation away from zero — you simply cannot resolve it above the noise.

At n = 1,000 per arm, the standard error on each proportion is about 0.0095, and the standard deviation of the gap falls to about 0.013. Now the 3-point lift sits more than two standard deviations above zero. The signal is the same; the noise floor dropped.

Same 3-point lift (10% → 13%), two sample sizes. Error bars are 95% confidence intervals. At n = 50 the intervals overlap completely. At n = 1,000 they separate cleanly.

Three numbers every experiment needs

Before any test runs, three quantities must be fixed. Most teams fix none of them.

Minimum Detectable Effect (MDE) — the smallest lift you actually care about shipping. This is a business decision masquerading as a statistics decision. If your checkout rate is already 10%, a 0.1-point improvement is probably noise in revenue reporting — it won’t move a quarterly OKR. A 3-point improvement would. Set the MDE to reflect the smallest effect worth acting on, not the biggest you hope for.

Statistical power — the probability that, if the true effect is at least the MDE, your test will actually detect it (return p below your threshold). The conventional target is 80%. That means one in five real effects of that size will still be missed by the test — a sobering reminder that “no result” is not the same as “no effect”.

Significance level (alpha) — the false-positive rate you tolerate. Conventional 5% means that if the null (no difference) is true, you’ll incorrectly declare significance once in twenty tests. In an organization running dozens of tests per quarter, that is multiple false positives per year — all of which will get shipped.

Once you have MDE, power, and alpha, the required sample size per arm follows from a formula. For two proportions p_c and p_t it involves the z-quantiles at your chosen alpha and power and the pooled standard error — any sample-size calculator embeds this. The point is not to memorize the formula; the point is to accept that the number is determined before you start, not discovered after you peek.

For the checkout example — baseline 10%, MDE of 3 points, power 80%, two-sided alpha 5% — the required sample is roughly 1,700 visitors per arm. The five-day test with 50 per arm had a tiny fraction of that — nowhere near the power to detect a 3-point lift.

The peeking problem is actually a sample-size problem

Here is a failure mode that feels different but stems from the same root. The test is running. Day three arrives. Someone opens the dashboard. Treatment is up 4 points! They tell a VP. The VP asks when this can ship. Pressure builds to call the test early.

Checking statistical significance repeatedly as data accumulates inflates your false-positive rate dramatically. If you check every day over a 14-day test and stop whenever p drops below 5%, the true probability of a spurious significant result on a flat null can reach 30–50%. You have corrupted the test’s operating characteristics by treating a fixed-sample test as a sequential one.

The fix is not willpower. It is fixing the sample size in advance and committing not to look at significance until it is reached. If the business cannot tolerate waiting, use a sequential testing method (such as always-valid p-values or Bayesian sequential designs) that explicitly accounts for repeated looks — but this is a separate decision with its own tradeoffs, not a free lunch.

The impulse to peek comes from the same anxiety that leads teams to run small tests in the first place: impatience. Both behaviors damage the experiment’s reliability in the same direction.

Significant is not the same as important

Here is the flip side. With enough traffic, a test that runs for months will detect effects you should not care about. A 0.02 percentage-point improvement in checkout rate — at a company doing ten million visits per month — will return a p-value of essentially zero. The statistics are impeccable. The business relevance is questionable.

This is the distinction between statistical significance (the sample size was large enough to resolve the difference from zero) and practical significance (the effect is large enough to matter for the business). They are orthogonal. An effect can be:

Statistical result	Business result	Interpretation
Significant	Large lift	Ship it immediately
Significant	Tiny lift	Decide if implementation cost justifies it
Not significant	—	Test was underpowered, or effect is small
Not significant	—	True null; design genuinely didn’t help

The bottom two rows are indistinguishable without knowing sample size and MDE. That is the core problem. When a result is not significant, you cannot tell “this variant is neutral” from “this test was too small to know” unless you computed the required sample size beforehand.

Always report the effect size and confidence interval alongside the p-value. A confidence interval for the lift that spans -1 point to +7 points is a different statement than one that spans 2.5 to 3.5 points, even if both cross the significance threshold.

How this actually plays out in industry

Product teams at high-traffic consumer companies — the ones running hundreds of experiments per year — have largely solved this by industrializing the sample-size calculation. The experiment creation UI forces you to specify an MDE, computes the required runtime based on current traffic, and locks significance checks until the runtime is complete. Peeking is a UI affordance that has been deliberately removed.

At lower-traffic companies, this discipline is harder because the required runtime for a properly powered test might be six weeks or three months. The temptation is to accept a weaker test — lower power, larger MDE — to get answers faster. That is a legitimate tradeoff, but it must be explicit. If your test only has 40% power at a 3-point MDE, a negative result carries almost no information. You should say so in your experiment write-up, not paper over it.

The other common mistake is running too many simultaneous experiments on overlapping user populations. Each test that shares traffic with another introduces interaction effects that neither test accounts for. The statistics on each individual test look fine; the conclusions may be correlated in ways that don’t show up until a variant that “won” in testing performs worse in production. Mutual exclusion layers in experimentation platforms exist precisely to prevent this.

The three decisions happen before data collection starts. Reversing the order — running first, then asking if the test was big enough — is the root cause of most underpowered experiments.

What good experiment hygiene looks like

A clean experiment write-up, even an informal one shared in Slack, should answer five questions:

What is the baseline metric and its current value?
What is the MDE — the minimum lift we would act on?
What sample size was required, and was the test fully powered when stopped?
What was the observed effect size and its 95% confidence interval?
Is the effect practically significant given implementation cost?

Question 3 is the one that almost never gets answered. It gets replaced by a p-value without context, and the reader is left to guess whether the test had enough power to learn anything useful from a null result. That ambiguity gets propagated into decisions — designs get killed that deserved a bigger test, and small lifts get celebrated because statistical significance was treated as the end of the analysis rather than one input to it.

The statistics are not the hard part. The hard part is making the business commit to a minimum effect size that actually reflects the economics of the decision — before anyone has seen the data, when anchoring bias is strongest and the impulse to “just run a quick test” is hardest to resist.

The shape of what you’re actually estimating

When you run an A/B test, you are not measuring whether the treatment “works.” You are estimating a number — the true difference in conversion rates — with some measurement uncertainty. The p-value tells you whether that number is distinguishable from zero given your noise floor. The confidence interval tells you the plausible range of the true number. The power calculation tells you whether your noise floor was low enough to see the effect you cared about.

These three things tell a coherent story about the quality of an experiment. A significant result from a powerful test is strong evidence. A non-significant result from an underpowered test is not evidence of anything — it is silence. Treating silence as evidence against the hypothesis is an error that kills good product decisions every day.

The lesson is not “get more traffic.” Sometimes traffic is genuinely limited and the minimum detectable effect just has to be larger. The lesson is: know your noise floor before you start. The error bars exist whether you calculate them or not. Understanding them in advance is the difference between an experiment and a guess.