datarekha
Statistics & Probability Medium Asked at MetaAsked at GoogleAsked at AmazonAsked at Booking

An A/B test comes back non-significant. How do you interpret and communicate that result?

The short answer

A non-significant result does not mean the treatment has no effect; it means the data are insufficient to distinguish the observed difference from noise at the specified power level. The correct interpretation depends entirely on the statistical power of the test — a well-powered flat result is evidence of no meaningful effect; an underpowered flat result is inconclusive.

How to think about it

Failing to reject is not the same as accepting the null

This is the most common misinterpretation in practice. A p-value of 0.3 does not mean the treatment effect is zero. It means: given the observed data, if there were truly no effect, this result would occur 30 % of the time. That is consistent with a true effect of zero, but also with a true effect too small to detect given the sample size.

The right diagnostic framework

  1. Check power. Did the test accumulate the pre-specified sample size? If yes and power was 80 %, a non-significant result is moderately strong evidence against effects larger than the MDE. If sample size fell short (e.g., you hit 40 % of the target), the test is underpowered and you cannot conclude much either way.

  2. Report the confidence interval, not just the p-value. A 95 % CI of [-0.1 %, +0.2 %] on conversion rate tells a very different story from a CI of [-2 %, +2 %]. The former rules out practically meaningful effects; the latter does not.

  3. Distinguish three outcomes:

    • Equivalence: CI is narrow and excludes effects above the MDE in both directions. Strong evidence the treatment does not matter.
    • Inconclusive: CI is wide and includes both practically meaningful and null effects. The test lacked power; no conclusion is warranted.
    • Directional signal below significance threshold: point estimate is in the expected direction but CI crosses zero. Consider whether to run a follow-up test with larger sample or treat as a null.

Communicating to stakeholders

Frame it with the confidence interval and the power check: “The test was fully powered to detect a 0.5 pp lift with 80 % probability. The observed lift was 0.1 pp (95 % CI: -0.3 to +0.5 pp). This provides good evidence that the treatment, if it has any effect, moves conversion by less than 0.5 pp — below the threshold we considered business-meaningful.” That is actionable. “The result was not significant” is not.

If the confidence interval is wide and power was low, the correct recommendation is either to run a larger follow-up experiment or to formally run an equivalence test (TOST — two one-sided tests) if you need to affirmatively declare the treatment negligible for business or regulatory purposes.

Learn it properly A/B testing

Keep practising

All Statistics & Probability questions

Explore further

Skip to content