An A/B test comes back non-significant. How do you interpret and communicate that result?
A non-significant result does not mean the treatment has no effect; it means the data are insufficient to distinguish the observed difference from noise at the specified power level. The correct interpretation depends entirely on the statistical power of the test — a well-powered flat result is evidence of no meaningful effect; an underpowered flat result is inconclusive.
How to think about it
Failing to reject is not the same as accepting the null
This is the most common misinterpretation in practice. A p-value of 0.3 does not mean the treatment effect is zero. It means: given the observed data, if there were truly no effect, this result would occur 30 % of the time. That is consistent with a true effect of zero, but also with a true effect too small to detect given the sample size.
The right diagnostic framework
-
Check power. Did the test accumulate the pre-specified sample size? If yes and power was 80 %, a non-significant result is moderately strong evidence against effects larger than the MDE. If sample size fell short (e.g., you hit 40 % of the target), the test is underpowered and you cannot conclude much either way.
-
Report the confidence interval, not just the p-value. A 95 % CI of [-0.1 %, +0.2 %] on conversion rate tells a very different story from a CI of [-2 %, +2 %]. The former rules out practically meaningful effects; the latter does not.
-
Distinguish three outcomes:
- Equivalence: CI is narrow and excludes effects above the MDE in both directions. Strong evidence the treatment does not matter.
- Inconclusive: CI is wide and includes both practically meaningful and null effects. The test lacked power; no conclusion is warranted.
- Directional signal below significance threshold: point estimate is in the expected direction but CI crosses zero. Consider whether to run a follow-up test with larger sample or treat as a null.
Communicating to stakeholders
Frame it with the confidence interval and the power check: “The test was fully powered to detect a 0.5 pp lift with 80 % probability. The observed lift was 0.1 pp (95 % CI: -0.3 to +0.5 pp). This provides good evidence that the treatment, if it has any effect, moves conversion by less than 0.5 pp — below the threshold we considered business-meaningful.” That is actionable. “The result was not significant” is not.
If the confidence interval is wide and power was low, the correct recommendation is either to run a larger follow-up experiment or to formally run an equivalence test (TOST — two one-sided tests) if you need to affirmatively declare the treatment negligible for business or regulatory purposes.