A/B testing in practice: sample size, p-values, and the traps

Every product team has a graveyard of A/B tests that “showed no effect” — and a suspiciously large share of those results are meaningless noise dressed up as evidence. The machinery of a controlled experiment is deceptively simple: randomize users into groups, change one thing for one group, measure the difference, and ask whether that difference is bigger than chance alone could produce. When it works, it is the closest thing social science has to a controlled laboratory. When it is done sloppily, it produces confident-sounding numbers that point in completely wrong directions.

This post walks through the statistics you need to understand, the sample-size math you must do before you start, and the six traps that corrupt results even when the engineering is perfect. The business analytics fundamentals are the foundation; this is where they get stress-tested.

The experiment frame: one change, two groups, one question

A randomized controlled experiment assigns each incoming unit — a user, a session, an order — to exactly one condition before any outcome is observed. The control group sees the current experience. The treatment group sees the change. Because assignment is random, every other factor that might influence the outcome (time of day, device type, geography) is, in expectation, balanced across groups.

That balance is the whole point. Without randomization, you are doing observational analysis, and every difference between groups is a confound suspect. With randomization, the only systematic difference between groups is the thing you changed. The question then becomes purely statistical: is the observed gap in outcomes larger than what random assignment alone could produce?

What a p-value actually means — and what it does not

This is the most mis-stated concept in applied statistics, and the misconception costs teams real decisions.

The null hypothesis is “the treatment has zero effect.” When p < 0.05, you are saying the data would be unlikely under that null — so you reject it at a 5% false-positive rate (alpha = 0.05). You are not saying anything directly about the probability that the effect is real, or large, or worth shipping. For the latter questions you need effect-size estimates and confidence intervals, which we get to shortly.

The distinction matters enormously in practice. Teams that read p-values as “probability of being right” ship weak effects confidently and skip weaker effects that actually matter. See averages that lie for another domain where summary statistics mislead in exactly this kind of way.

Statistical power, MDE, and why sample size is computed before the test

Every A/B test is characterized by four numbers that are mathematically linked. Fix any three and the fourth is determined:

alpha — the false-positive rate (Type I error). Conventionally 0.05.
1 - beta — statistical power, the probability of detecting a real effect when one exists. Conventionally 0.80 or 0.90.
MDE — the minimum detectable effect: the smallest true difference you care about finding.
n — sample size per group.

The required sample size for a two-proportion test is approximately:

n ≈ 2 × (z_alpha/2 + z_beta)² × p̄(1 − p̄) / MDE²

where p̄ is the pooled baseline rate and z_alpha/2, z_beta are the standard-normal quantiles for your chosen error rates. For alpha = 0.05 (two-tailed) and 1 - beta = 0.80, the coefficient (z_alpha/2 + z_beta)² is approximately 7.85.

The critical insight is the 1 / MDE² term. Sample size scales with the square of the inverse of the effect you are trying to detect. If you want to detect a 1 pp lift instead of a 2 pp lift, you need four times as many observations — not twice as many. A realistic worked example:

import math

baseline   = 0.10   # 10% conversion
mde        = 0.02   # want to detect a 2 pp absolute lift
alpha      = 0.05
power      = 0.80

from scipy.stats import norm
z_a = norm.ppf(1 - alpha / 2)   # 1.96
z_b = norm.ppf(power)            # 0.84

p_bar = baseline + mde / 2
n = 2 * (z_a + z_b)**2 * p_bar * (1 - p_bar) / mde**2
print(f"Required n per arm: {math.ceil(n):,}")
# Required n per arm: 3,842

At 500 daily visitors split 50/50, that test needs at minimum 15 days to reach sample. If you halve the MDE to 1 pp, you need roughly 15,000 per arm — 60 days. Power analysis done in advance tells you whether the test is feasible before you spend a month collecting data. Skipping it is the primary reason experiments are called “inconclusive” when the truth is “we were never going to see it.”

Read the dedicated deep-dive on A/B testing and sample size for more on how to set realistic MDEs from business context rather than statistical convenience.

Diagram 1: control vs variant distributions with alpha and power regions

The null distribution (grey) and the alternative distribution (blue) overlap. The shaded red right tail is the false-positive rate α. The shaded blue region to the right of the critical value is power (1−β). A smaller MDE moves the distributions closer together, shrinking power — which is why you need more data to detect smaller effects.

Confidence intervals: the thing that matters more than “significant”

A confidence interval gives you two things a p-value cannot: the estimated magnitude of the effect and the uncertainty around it. A 95% CI of [+0.1%, +2.9%] tells you the lift is probably small; a 95% CI of [+1.8%, +8.2%] tells you it is probably meaningful. Both can be “statistically significant.” Only one is worth shipping.

The correct interpretation of a 95% CI is: if you ran this experiment many times under identical conditions and computed a CI each time, 95% of those intervals would contain the true effect. It is not “there is a 95% probability the true effect is in this interval” (that is the Bayesian credible-interval framing, which requires a prior). The practical takeaway is the same: a narrow CI that excludes zero is your goal, not a small p-value.

The six traps that corrupt real experiments

Trap 1: peeking and optional stopping

This is the most common and the most dangerous mistake. You launch a test, check the dashboard every morning, and stop when p < 0.05. This procedure does not hold a 5% false-positive rate. It inflates it dramatically.

The reason is subtle but important. A p-value computed at a fixed sample size has the correct false-positive rate by construction. Computed at a data-dependent stopping point, it does not — you are selecting the most extreme random fluctuation from many looks, which biases the result upward. The diagram below shows this directly.

Diagram 2: cumulative p-value wandering under peeking

Three simulated null experiments (no real effect). Two of the three p-value trajectories wander below α = 0.05 at some peek, producing false positives. A pre-registered fixed sample size prevents you from stopping at those moments.

The fix is to pre-register the final sample size and never act on intermediate results. If you need early stopping for practical reasons, use sequential testing methods (e.g., the sequential probability ratio test, or mSPRT) or always-valid p-values that maintain error-rate guarantees across repeated looks. Bayesian approaches with pre-specified stopping rules also handle this cleanly, though they require choosing a prior. The key word is “pre-specified” — the fix is not a different formula, it is a commitment made before the data arrives.

Trap 2: running until significant

A close cousin of peeking. You declare in advance you will run “until the test is significant.” This is optional stopping with no bound, and the false-positive rate converges to 1.0 as the number of peeks grows without limit. Even a perfectly null experiment will eventually produce a p < 0.05 reading if you wait long enough — and then you stop.

Trap 3: multiple comparisons

You measure conversion, revenue per user, bounce rate, session length, and three secondary engagement metrics across two variants plus two sub-variants. That is many simultaneous tests. At alpha = 0.05 each, the probability of at least one false positive across 14 independent tests is 1 - (0.95)^14 ≈ 51%. You will almost certainly find something “significant” by chance.

The standard corrections are Bonferroni (divide alpha by the number of tests — conservative but simple) or Benjamini-Hochberg (controls the false discovery rate — better for exploratory analysis). The better fix is to pre-register exactly one primary metric and treat everything else as exploratory only.

Trap 4: Simpson’s paradox when pooling segments

A variant wins on overall conversion but loses in both mobile and desktop sub-segments. This is not a contradiction — it is Simpson’s paradox, and it happens when traffic allocation is unbalanced across segments that have different baseline rates. If your variant saw proportionally more desktop users (who convert at 15%) than mobile users (who convert at 5%), the overall pooled rate can look higher even if the variant underperforms within every segment.

Check your results stratified by major dimensions — device, new vs returning, geography — before drawing conclusions. Cohort analysis alongside experiment analysis is covered in cohort retention analysis.

Trap 5: novelty and primacy effects

New features attract curious users who engage more than they will in steady state (novelty effect). Existing users sometimes resist changes and under-engage initially (primacy effect). Both inflate or deflate observed lift during the first few days. Running for at least one to two full business cycles — typically two weeks minimum for most consumer products — helps the effect stabilize.

Trap 6: sample ratio mismatch (SRM)

If you assign 50% of traffic to each arm but observe 48% control and 52% treatment, something is wrong upstream: a bot filter, a caching layer, a JavaScript error, or an allocation bug. This is a sample ratio mismatch. It means the two groups are no longer comparable, and any result — positive or negative — should be discarded until the root cause is fixed.

SRM is a data-integrity check that costs nothing to run. Compare expected vs observed assignment counts with a simple chi-square test before you look at any outcome metrics. Check A/B testing in business contexts for tooling examples.

Practical checklist before you launch

Pre-registration discipline eliminates most of the traps above in one step:

Define the primary metric (one) and the guardrail metrics (a short list of things you must not break — revenue, error rate, latency).
Compute required sample size using your baseline rate, desired MDE, alpha = 0.05, and 1 - beta = 0.80 (or 0.90 for high-stakes decisions).
Commit to a fixed end date based on traffic forecasts. Add a calendar reminder; do not check results before it.
Run for full business cycles — at minimum two weeks to wash out weekday/weekend variation.
Check for SRM on day one, before the test is conclusive.
Report the CI and effect size, not just the p-value. Ask: is the lower bound of the CI commercially meaningful?

For more on interpreting these results in a business context, the glossary has concise definitions of power, MDE, and Type I vs Type II error, and the interview prep guide covers the experiment design questions you will encounter in data roles.

The decision frame that actually matters

Statistical significance is a gate, not a goal. A p < 0.05 result with a 95% CI of [+0.02%, +0.18%] on conversion is real — and probably not worth the engineering cost to ship and maintain. A p = 0.07 result with a 95% CI of [−0.1%, +3.9%] contains a plausible large upside and is probably worth a longer, better-powered replication.

The questions to ask after every experiment:

Is the effect size commercially meaningful even at the low end of the CI?
Are the guardrail metrics clean?
Did the SRM check pass?
Does the effect hold within major segments, or is it driven entirely by one slice?

Answering all four honestly is harder than checking a p-value. It is also what separates teams that learn from experiments from teams that generate confident-sounding noise.

Frequently asked questions

What is the difference between statistical significance and practical significance?

Statistical significance (p < alpha) tells you the observed effect is unlikely under the null hypothesis. Practical significance asks whether the effect is large enough to matter for the business. A test on millions of users can return p = 0.0001 for a 0.01 pp lift — statistically significant, not practically meaningful. Always pair the p-value with the confidence interval and a business judgment about the minimum effect worth acting on.

How long should an A/B test run?

Long enough to reach the pre-computed sample size, and at minimum long enough to cover one to two full business cycles (typically two weeks). Stopping earlier to capture a “hot” result is the peeking trap. Stopping later to find significance is optional stopping. The duration should be set before the test starts, based entirely on traffic forecasts and the required sample size.

Why does multiple testing inflate false positives, and how do I correct for it?

Each test independently has a alpha probability of a false positive. Running k tests multiplies the chance that at least one is spurious: 1 - (1 - alpha)^k. For 10 tests at alpha = 0.05 that is a 40% chance of at least one false positive. Pre-register a single primary metric to avoid the problem entirely. If you must test multiple metrics, apply a Bonferroni correction (use alpha / k) or the Benjamini-Hochberg procedure for exploratory analysis.

What is a sample ratio mismatch and why does it invalidate results?

A sample ratio mismatch occurs when the observed split between control and treatment differs from the intended allocation. If you intend 50/50 but observe 48/52, some systematic difference in who ends up in each group has been introduced — bot filtering, error pages that redirect only one group, or a logging bug. Because the groups are no longer exchangeable, any measured outcome difference is confounded. Run a chi-square test on assignment counts at the start of the experiment; discard results if SRM is detected until the source is identified and fixed.