What is the difference between a p-value and an effect size, and why can a result be statistically significant but practically meaningless?
A p-value measures the probability of seeing data at least as extreme as observed under the null hypothesis — it quantifies evidence against the null, not the magnitude of an effect. Effect size quantifies how large the difference or relationship is in meaningful units. With a large enough sample, a trivially small effect can produce an arbitrarily small p-value.
How to think about it
Statistical significance and practical significance are independent. Reporting only p-values without effect sizes is a primary driver of the replication crisis in empirical science and a common mistake in A/B test reporting.
What a p-value actually is
Given null hypothesis H₀ and observed test statistic T:
p = P(|T| ≥ |t_observed| | H₀ true)
A p-value of 0.02 means: if H₀ were true, there is a 2% chance of observing a result this extreme or more extreme by chance alone. It does not mean:
- There is a 2% probability that H₀ is true.
- The effect is large or important.
- The result will replicate.
Why large n makes p-values misleading
With n = 1 000 000, the standard error of a mean difference is very small. A difference of $0.01 in average order value between two product page variants can achieve p < 0.001 even if the business cares only about differences above $2.00.
Common effect size measures
| Context | Measure | Rule of thumb (Cohen) |
|---|---|---|
| Two means | Cohen’s d = (μ₁ - μ₂) / sᴅ | 0.2 small, 0.5 medium, 0.8 large |
| Correlation | Pearson r | 0.1 small, 0.3 medium, 0.5 large |
| Proportions | Odds ratio, relative risk | OR = 1 is null; OR > 2 often practically meaningful |
| ANOVA | η² = SSᴇᴏᴈᴊᴀᴋ / SSᴛᴂᴛᴀᴍ | 0.01, 0.06, 0.14 small/medium/large |
Worked example
An A/B test on a checkout button colour runs for 30 days with n = 500 000 users per variant. Results:
- Control conversion rate: 3.210%
- Variant conversion rate: 3.221%
- Absolute difference: 0.011 percentage points
- p-value: 0.003 (highly significant)
- Relative lift: 0.34%
The result is statistically significant but whether 0.011 pp is practically meaningful depends on revenue impact, engineering cost, and strategic priorities — not on the p-value.
Best practices
- Always report the effect size and its confidence interval alongside the p-value.
- Pre-specify the minimum detectable effect (MDE) before running the test based on business significance, not statistical power alone.
- For large-n settings, consider reporting the practical significance threshold explicitly and testing whether the CI excludes it.