datarekha

A/B Testing — Did That Change Actually Work?

The new checkout button converted 13% of visitors versus the old one's 10%. Did the new button really win, or did we just get lucky with 1,000 visitors each? A/B testing gives a rigorous answer.

8 min read Intermediate Business Analytics Lesson 20 of 21

What you'll learn

  • What an A/B test is and why randomization makes the comparison fair
  • Why random noise means a gap could be luck — and how statistics separates signal from noise
  • What a p-value actually is (and the two things it is NOT)
  • Why sample size determines whether a real difference is detectable
  • Why statistical significance is not the same as business significance

Before you start

Your design team shipped a new checkout button. Over two weeks you collected data: 1,000 visitors saw the old button, 100 completed a purchase (10%). Another 1,000 visitors saw the new button, 130 completed a purchase (13%). The new button looks better — but is it actually better, or did randomness hand you a flattering result by chance?

That question is what A/B testing was built to answer.

What an A/B test is

An A/B test is a controlled experiment: you randomly split your users into two groups, show each group exactly one version of whatever you are testing, and then compare a metric across the groups.

  • Group A — the control — sees the current version (the old button). “Control” means the baseline you are measuring against.
  • Group B — the treatment — sees the new version (the new button). “Treatment” is the change you are evaluating.
  • The metric is the number you care about. Here it is conversion rate — the share of visitors who complete a purchase. Group A converted 100 out of 1,000 visitors, so its conversion rate is 10%. Group B converted 130 out of 1,000, so its conversion rate is 13%.

The critical ingredient is randomization — assigning visitors to groups at random. Because neither group self-selected, the two groups are statistically identical on every other dimension (age, device, location, time of day). The only systematic difference between them is which button they saw. That is what makes the comparison fair, and that is the property that distinguishes an A/B test from simply comparing Monday traffic to Tuesday traffic.

The problem of noise

Even if the two buttons were completely identical, you would almost never observe exactly 10% in both groups. Random sampling produces variation. On any given day, Group A might land at 9.4% and Group B at 10.7% purely by chance — no difference in the buttons, just the luck of which visitors happened to show up.

So when you observe 10% versus 13%, you face a fundamental ambiguity: is the 3-point gap a real effect of the new button, or is it within the normal range of random variation you would expect even with identical buttons?

Statistics separates signal from noise. The logic runs like this: if A and B were truly identical, how often would random chance alone produce a gap as large as 3 percentage points (or larger) on samples of 1,000 each? If that happens very rarely — say, less than 5% of the time — then the gap is unlikely to be pure luck, and we credit the new button.

Statistical significance and the p-value

The tool that formalizes this reasoning is the p-value — the probability of seeing a gap this big or bigger, purely by chance, if there were truly no difference between A and B.

A small p-value means: “luck alone would rarely produce a gap this large.” Conventionally, if the p-value is below 0.05 (5%), we call the result statistically significant — meaning the evidence against the “it’s all luck” explanation is strong enough that we reject it and credit the treatment.

The playground below computes everything from the raw numbers. Run it and read the output.

The output:

A (old) conversion : 10.0%
B (new) conversion : 13.0%
absolute lift      : +3 points
relative lift      : +30%
z-score            : 2.10
p-value            : 0.0355
significant at 0.05: True

Breaking it down:

  • Absolute lift is the raw difference in conversion rates: 13% minus 10% equals +3 percentage points.
  • Relative lift expresses that gain as a share of the baseline: 3 divided by 10 equals +30%. This is the number marketing tends to quote (“30% uplift!”) — but the absolute lift is what determines how many additional customers you actually get.
  • Z-score (2.10) is the gap measured in units of the standard error — a measure of how much random variation you would expect in samples of this size. A z-score above about 1.96 corresponds to p below 0.05 for a two-sided test.
  • P-value (0.0355) is below 0.05, so the result is statistically significant. A gap of 3 points or more would occur by luck alone only about 3.6% of the time if the buttons were identical. That is rare enough to credit the new button.

B converts 13% versus A’s 10% — a +3-point (+30% relative) lift. With a p-value of 0.0355, the result clears the 0.05 threshold. The new button likely does convert better.

Why sample size matters

Notice that the test needs 1,000 visitors per group to detect this difference. The denominator of the z-score is the standard error — the typical size of random fluctuation. Larger samples shrink the standard error, making the z-score larger and the p-value smaller.

Try editing the playground: change nA, cA = 1000, 100 to nA, cA = 50, 5 and nB, cB = 1000, 130 to nB, cB = 50, 7 (the same 10% vs 14% rates, roughly, on 50 visitors each). The p-value jumps well above 0.05 — the same directional lift is no longer detectable because the sample is too small for the noise to settle.

This is why you must decide on a target sample size before starting the test — not after you peek at early results and like what you see.

What the z-score measures

The z-score (here, 2.10) counts how many standard errors separate the two conversion rates. It is purely a normalized way to place the observed gap on the normal distribution so we can look up (or compute) the p-value. A z-score of 2.10 means the gap is 2.10 times the size of the typical random fluctuation we expect at this sample size — large enough to be convincing.

You do not need to memorize z-score cutoffs. The only number you need to compare to a threshold is the p-value itself.


Quick check

0/3
Q1A/B test on a new email subject line: 5,000 recipients in each group, open rate 22% (control) vs 23.8% (treatment), p-value = 0.009. A colleague says 'great, there's only a 0.9% chance the result is due to luck.' What is wrong with that interpretation?
Q2A product team runs an A/B test. After day 3, they check results and see p = 0.04. They stop the test and ship the new feature. What is the likely problem?
Q3A startup tests a new onboarding flow. Control: 200 users, 20 convert (10%). Treatment: 200 users, 26 convert (13%). The p-value is 0.29. An investor points out that 13% vs 10% is the same +30% relative lift you saw in the lesson, yet the result is not significant. What best explains the difference?

Next

Storytelling with data — turning a significant result into a decision the room actually acts on.

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content