datarekha
Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at NetflixAsked at Airbnb

What is the difference between a p-value and an effect size, and why can a result be statistically significant but practically meaningless?

The short answer

A p-value measures the probability of seeing data at least as extreme as observed under the null hypothesis — it quantifies evidence against the null, not the magnitude of an effect. Effect size quantifies how large the difference or relationship is in meaningful units. With a large enough sample, a trivially small effect can produce an arbitrarily small p-value.

How to think about it

Statistical significance and practical significance are independent. Reporting only p-values without effect sizes is a primary driver of the replication crisis in empirical science and a common mistake in A/B test reporting.

What a p-value actually is

Given null hypothesis H₀ and observed test statistic T:

p = P(|T| ≥ |t_observed| | H₀ true)

A p-value of 0.02 means: if H₀ were true, there is a 2% chance of observing a result this extreme or more extreme by chance alone. It does not mean:

  • There is a 2% probability that H₀ is true.
  • The effect is large or important.
  • The result will replicate.

Why large n makes p-values misleading

With n = 1 000 000, the standard error of a mean difference is very small. A difference of $0.01 in average order value between two product page variants can achieve p < 0.001 even if the business cares only about differences above $2.00.

Common effect size measures

ContextMeasureRule of thumb (Cohen)
Two meansCohen’s d = (μ₁ - μ₂) / sᴅ0.2 small, 0.5 medium, 0.8 large
CorrelationPearson r0.1 small, 0.3 medium, 0.5 large
ProportionsOdds ratio, relative riskOR = 1 is null; OR > 2 often practically meaningful
ANOVAη² = SSᴇᴏᴈᴊᴀᴋ / SSᴛᴂᴛᴀᴍ0.01, 0.06, 0.14 small/medium/large

Worked example

An A/B test on a checkout button colour runs for 30 days with n = 500 000 users per variant. Results:

  • Control conversion rate: 3.210%
  • Variant conversion rate: 3.221%
  • Absolute difference: 0.011 percentage points
  • p-value: 0.003 (highly significant)
  • Relative lift: 0.34%

The result is statistically significant but whether 0.011 pp is practically meaningful depends on revenue impact, engineering cost, and strategic priorities — not on the p-value.

Best practices

  1. Always report the effect size and its confidence interval alongside the p-value.
  2. Pre-specify the minimum detectable effect (MDE) before running the test based on business significance, not statistical power alone.
  3. For large-n settings, consider reporting the practical significance threshold explicitly and testing whether the CI excludes it.
Learn it properly A/B testing

Keep practising

All Statistics & Probability questions

Explore further

Skip to content