A/B Testing — Did That Change Actually Work?
The new checkout button converted 13% of visitors versus the old one's 10%. Did the new button really win, or did we just get lucky with 1,000 visitors each? A/B testing gives a rigorous answer.
What you'll learn
- What an A/B test is and why randomization makes the comparison fair
- Why random noise means a gap could be luck — and how statistics separates signal from noise
- What a p-value actually is (and the two things it is NOT)
- Why sample size determines whether a real difference is detectable
- Why statistical significance is not the same as business significance
Before you start
Your design team shipped a new checkout button. Over two weeks you collected data: 1,000 visitors saw the old button, 100 completed a purchase (10%). Another 1,000 visitors saw the new button, 130 completed a purchase (13%). The new button looks better — but is it actually better, or did randomness hand you a flattering result by chance?
That question is what A/B testing was built to answer.
What an A/B test is
An A/B test is a controlled experiment: you randomly split your users into two groups, show each group exactly one version of whatever you are testing, and then compare a metric across the groups.
- Group A — the control — sees the current version (the old button). “Control” means the baseline you are measuring against.
- Group B — the treatment — sees the new version (the new button). “Treatment” is the change you are evaluating.
- The metric is the number you care about. Here it is conversion rate — the share of visitors who complete a purchase. Group A converted 100 out of 1,000 visitors, so its conversion rate is 10%. Group B converted 130 out of 1,000, so its conversion rate is 13%.
The critical ingredient is randomization — assigning visitors to groups at random. Because neither group self-selected, the two groups are statistically identical on every other dimension (age, device, location, time of day). The only systematic difference between them is which button they saw. That is what makes the comparison fair, and that is the property that distinguishes an A/B test from simply comparing Monday traffic to Tuesday traffic.
The problem of noise
Even if the two buttons were completely identical, you would almost never observe exactly 10% in both groups. Random sampling produces variation. On any given day, Group A might land at 9.4% and Group B at 10.7% purely by chance — no difference in the buttons, just the luck of which visitors happened to show up.
So when you observe 10% versus 13%, you face a fundamental ambiguity: is the 3-point gap a real effect of the new button, or is it within the normal range of random variation you would expect even with identical buttons?
Statistics separates signal from noise. The logic runs like this: if A and B were truly identical, how often would random chance alone produce a gap as large as 3 percentage points (or larger) on samples of 1,000 each? If that happens very rarely — say, less than 5% of the time — then the gap is unlikely to be pure luck, and we credit the new button.
Statistical significance and the p-value
The tool that formalizes this reasoning is the p-value — the probability of seeing a gap this big or bigger, purely by chance, if there were truly no difference between A and B.
A small p-value means: “luck alone would rarely produce a gap this large.” Conventionally, if the p-value is below 0.05 (5%), we call the result statistically significant — meaning the evidence against the “it’s all luck” explanation is strong enough that we reject it and credit the treatment.
The playground below computes everything from the raw numbers. Run it and read the output.
The output:
A (old) conversion : 10.0%
B (new) conversion : 13.0%
absolute lift : +3 points
relative lift : +30%
z-score : 2.10
p-value : 0.0355
significant at 0.05: True
Breaking it down:
- Absolute lift is the raw difference in conversion rates: 13% minus 10% equals +3 percentage points.
- Relative lift expresses that gain as a share of the baseline: 3 divided by 10 equals +30%. This is the number marketing tends to quote (“30% uplift!”) — but the absolute lift is what determines how many additional customers you actually get.
- Z-score (2.10) is the gap measured in units of the standard error — a measure of how much random variation you would expect in samples of this size. A z-score above about 1.96 corresponds to p below 0.05 for a two-sided test.
- P-value (0.0355) is below 0.05, so the result is statistically significant. A gap of 3 points or more would occur by luck alone only about 3.6% of the time if the buttons were identical. That is rare enough to credit the new button.
B converts 13% versus A’s 10% — a +3-point (+30% relative) lift. With a p-value of 0.0355, the result clears the 0.05 threshold. The new button likely does convert better.
Why sample size matters
Notice that the test needs 1,000 visitors per group to detect this difference. The denominator of the z-score is the standard error — the typical size of random fluctuation. Larger samples shrink the standard error, making the z-score larger and the p-value smaller.
Try editing the playground: change nA, cA = 1000, 100 to nA, cA = 50, 5 and nB, cB = 1000, 130 to nB, cB = 50, 7 (the same 10% vs 14% rates, roughly, on 50 visitors each). The p-value jumps well above 0.05 — the same directional lift is no longer detectable because the sample is too small for the noise to settle.
This is why you must decide on a target sample size before starting the test — not after you peek at early results and like what you see.
What the z-score measures
The z-score (here, 2.10) counts how many standard errors separate the two conversion rates. It is purely a normalized way to place the observed gap on the normal distribution so we can look up (or compute) the p-value. A z-score of 2.10 means the gap is 2.10 times the size of the typical random fluctuation we expect at this sample size — large enough to be convincing.
You do not need to memorize z-score cutoffs. The only number you need to compare to a threshold is the p-value itself.
Quick check
Next
Storytelling with data — turning a significant result into a decision the room actually acts on.
Practice this in an interview
All questionsA rigorous A/B test requires a pre-registered hypothesis, a single primary metric, sample size calculated before launch, random unit-level assignment, and a fixed runtime. Skipping any of these steps opens the door to false positives and post-hoc rationalization.
Run until you have reached the pre-calculated sample size — which should include at least one full weekly cycle to average out day-of-week effects. Stopping early because results look good, or extending because they do not, both inflate error rates.
A non-significant result does not mean the treatment has no effect; it means the data are insufficient to distinguish the observed difference from noise at the specified power level. The correct interpretation depends entirely on the statistical power of the test — a well-powered flat result is evidence of no meaningful effect; an underpowered flat result is inconclusive.
Novelty effect is the temporary engagement spike users show with any change simply because it is new; primacy effect is the temporary dip when users resist a change to a familiar interface. Both cause the short-term treatment effect to differ materially from the long-term steady-state effect.