How do you design an A/B test from scratch?

A rigorous A/B test requires a pre-registered hypothesis, a single primary metric, sample size calculated before launch, random unit-level assignment, and a fixed runtime. Skipping any of these steps opens the door to false positives and post-hoc rationalization.

How long should you run an A/B test?

Run until you have reached the pre-calculated sample size — which should include at least one full weekly cycle to average out day-of-week effects. Stopping early because results look good, or extending because they do not, both inflate error rates.

An A/B test comes back non-significant. How do you interpret and communicate that result?

A non-significant result does not mean the treatment has no effect; it means the data are insufficient to distinguish the observed difference from noise at the specified power level. The correct interpretation depends entirely on the statistical power of the test — a well-powered flat result is evidence of no meaningful effect; an underpowered flat result is inconclusive.

How do you decide if a new model is actually better in production?

Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.

A/B Testing — Did That Change Actually Work? — Business Analytics

Optimization was only as trustworthy as the numbers we fed it — the $40 margin, the conversion rate, every input taken on faith. So the discipline we need before any of those numbers earns trust is this one: when a change looks like an improvement, how do you prove the improvement is real and not a lucky accident? That is exactly what A/B testing was built to answer.

Your design team shipped a new checkout button. Over two weeks you collected data: 1,000 visitors saw the old button, 100 completed a purchase (10%). Another 1,000 visitors saw the new button, 130 completed a purchase (13%). The new button looks better — but is it actually better, or did randomness hand you a flattering result by chance?

That question is what A/B testing was built to answer.

What an A/B test is

An A/B test is a controlled experiment: you randomly split your users into two groups, show each group exactly one version of whatever you are testing, and then compare a metric across the groups.

Group A — the control — sees the current version (the old button). “Control” means the baseline you are measuring against.
Group B — the treatment — sees the new version (the new button). “Treatment” is the change you are evaluating.
The metric is the number you care about. Here it is conversion rate — the share of visitors who complete a purchase. Group A converted 100 out of 1,000 visitors, so its conversion rate is 10%. Group B converted 130 out of 1,000, so its conversion rate is 13%.

The critical ingredient is randomization — assigning visitors to groups at random. Because neither group self-selected, the two groups are statistically identical on every other dimension (age, device, location, time of day). The only systematic difference between them is which button they saw. That is what makes the comparison fair, and that is the property that distinguishes an A/B test from simply comparing Monday traffic to Tuesday traffic.

The problem of noise

Even if the two buttons were completely identical, you would almost never observe exactly 10% in both groups. Random sampling produces variation. On any given day, Group A might land at 9.4% and Group B at 10.7% purely by chance — no difference in the buttons, just the luck of which visitors happened to show up.

So when you observe 10% versus 13%, you face a fundamental ambiguity: is the 3-point gap a real effect of the new button, or is it within the normal range of random variation you would expect even with identical buttons?

Statistics separates signal from noise. The logic runs like this: if A and B were truly identical, how often would random chance alone produce a gap as large as 3 percentage points (or larger) on samples of 1,000 each? If that happens very rarely — say, less than 5% of the time — then the gap is unlikely to be pure luck, and we credit the new button.

Statistical significance and the p-value

The tool that formalizes this reasoning is the p-value — the probability of seeing a gap this big or bigger, purely by chance, if there were truly no difference between A and B.

A small p-value means: “luck alone would rarely produce a gap this large.” Conventionally, if the p-value is below 0.05 (5%), we call the result statistically significant — meaning the evidence against the “it’s all luck” explanation is strong enough that we reject it and credit the treatment.

The snippet below computes everything from the raw numbers.

import math

# A = control (current button), B = treatment (new button)
nA, cA = 1000, 100   # A: visitors, conversions  -> 10.0%
nB, cB = 1000, 130   # B: visitors, conversions  -> 13.0%

pA, pB = cA / nA, cB / nB
pool = (cA + cB) / (nA + nB)                       # conversion if A and B were the same
se = math.sqrt(pool * (1 - pool) * (1 / nA + 1 / nB))
z = (pB - pA) / se                                 # how many standard errors apart
# two-sided p-value from the normal approximation
pval = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))

print("A (old) conversion : " + f"{pA:.1%}")
print("B (new) conversion : " + f"{pB:.1%}")
print("absolute lift      : +" + f"{(pB - pA) * 100:.0f}" + " points")
print("relative lift      : +" + f"{(pB - pA) / pA * 100:.0f}" + "%")
print("z-score            : " + f"{z:.2f}")
print("p-value            : " + f"{pval:.4f}")
print("significant at 0.05: " + str(pval < 0.05))

A (old) conversion : 10.0%
B (new) conversion : 13.0%
absolute lift      : +3 points
relative lift      : +30%
z-score            : 2.10
p-value            : 0.0355
significant at 0.05: True

Breaking it down:

Absolute lift is the raw difference in conversion rates: 13% minus 10% equals +3 percentage points.
Relative lift expresses that gain as a share of the baseline: 3 divided by 10 equals +30%. This is the number marketing tends to quote (“30% uplift!”) — but the absolute lift is what determines how many additional customers you actually get.
Z-score (2.10) is the gap measured in units of the standard error — a measure of how much random variation you would expect in samples of this size. A z-score above about 1.96 corresponds to p below 0.05 for a two-sided test.
P-value (0.0355) is below 0.05, so the result is statistically significant. A gap of 3 points or more would occur by luck alone only about 3.6% of the time if the buttons were identical. That is rare enough to credit the new button.

B converts 13% versus A’s 10% — a +3-point (+30% relative) lift. With a p-value of 0.0355, the result clears the 0.05 threshold. The new button likely does convert better.

Why sample size matters

Notice that the test needs 1,000 visitors per group to detect this difference. The denominator of the z-score is the standard error — the typical size of random fluctuation. Larger samples shrink the standard error, making the z-score larger and the p-value smaller.

To see this, re-run the same calculation with only 50 visitors per group — nA, cA = 50, 5 and nB, cB = 50, 7, an even bigger gap of 10% vs 14%:

A (old) conversion : 10.0%
B (new) conversion : 14.0%
absolute lift      : +4 points
relative lift      : +40%
z-score            : 0.62
p-value            : 0.5383
significant at 0.05: False

The lift is now larger — +4 points, +40% relative — yet the p-value has jumped from 0.0355 all the way to 0.54. On just 50 visitors each, a gap this size happens by luck more than half the time, so it tells you nothing. The effect didn’t shrink; the noise grew. This is why you must decide on a target sample size before starting the test — not after you peek at early results and like what you see.

Watch out

Three things people get wrong about p-values.

First: the p-value is NOT “the probability that B is better.” It is not a statement about how likely B is to be superior.

Second: the p-value is NOT “the probability the result is due to chance.” It is the probability of the observed gap (or a bigger one) assuming A and B are truly equal — a subtle but critical difference.

Third: never “peek” and stop the test the moment it looks significant. If you check your results daily and stop on the first day p crosses 0.05, you will manufacture false winners regularly — peeking repeatedly inflates your actual false-positive rate far above 5%. Fix your sample size up front and stop exactly once.

Finally, statistical significance is not business significance. A +0.1-point lift on a low-margin product might be statistically significant on a large enough sample but not worth the engineering cost to ship.

What the z-score measures

The z-score (here, 2.10) counts how many standard errors separate the two conversion rates. It is purely a normalized way to place the observed gap on the normal distribution so we can look up (or compute) the p-value. A z-score of 2.10 means the gap is 2.10 times the size of the typical random fluctuation we expect at this sample size — large enough to be convincing.

You do not need to memorize z-score cutoffs. The only number you need to compare to a threshold is the p-value itself.

In one breath

An A/B test randomly splits users into a control (old) and a treatment (new), so the two groups are identical on everything except the change — that randomization is what makes the comparison fair. Because random sampling alone produces variation, the question is whether an observed gap (10% vs 13%) is real or luck. The p-value answers it: the probability of seeing a gap this big if the two versions were truly identical — below 0.05 we call it statistically significant. Here p = 0.0355, so the button likely wins; but on only 50 visitors each the same lift gives p = 0.54, because sample size controls how much noise to expect. Guard against the classic errors: a p-value is not the chance B is better or the chance the result is luck; never peek and stop the moment it crosses 0.05; and statistical significance is not business significance.

Practice

Quick check

0/3

Q1A/B test on a new email subject line: 5,000 recipients in each group, open rate 22% (control) vs 23.8% (treatment), p-value = 0.009. A colleague says 'great, there's only a 0.9% chance the result is due to luck.' What is wrong with that interpretation?

Q2A product team runs an A/B test. After day 3, they check results and see p = 0.04. They stop the test and ship the new feature. What is the likely problem?

Q3A startup tests a new onboarding flow. Control: 200 users, 20 convert (10%). Treatment: 200 users, 26 convert (13%). The p-value is 0.29. An investor points out that 13% vs 10% is the same +30% relative lift you saw in the lesson, yet the result is not significant. What best explains the difference?

A question to carry forward

You now have an airtight result: p = 0.0355, the new button genuinely converts better. But sit in the meeting where you present it. “The p-value is 0.0355, which clears our 0.05 threshold for the two-sided z-test” — and you watch the room’s eyes glaze. A correct number, delivered like that, changes nothing.

So the final question of this whole section is the one that decides whether any of this analysis ever matters: how do you turn a real finding into a decision the room actually acts on? The last lesson is storytelling with data — taking the significant result, the forecast, the optimal mix, the unit economics, and shaping them into a story that moves people from “interesting” to “let’s do it.”

A/B Testing — Did That Change Actually Work?

What you'll learn

Before you start

What an A/B test is

The problem of noise

Statistical significance and the p-value

Why sample size matters

What the z-score measures

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further