Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at NetflixAsked at Airbnb

What is the difference between a p-value and an effect size, and why can a result be statistically significant but practically meaningless?

For Data Scientist Data Analyst ML Engineer AI / LLM Engineer

The short answer

A p-value measures the probability of seeing data at least as extreme as observed under the null hypothesis — it quantifies evidence against the null, not the magnitude of an effect. Effect size quantifies how large the difference or relationship is in meaningful units. With a large enough sample, a trivially small effect can produce an arbitrarily small p-value.

How to think about it

Statistical significance and practical significance are independent. Reporting only p-values without effect sizes is a primary driver of the replication crisis in empirical science and a common mistake in A/B test reporting.

What a p-value actually is

Given null hypothesis H₀ and observed test statistic T:

p = P(|T| ≥ |t_observed| | H₀ true)

A p-value of 0.02 means: if H₀ were true, there is a 2% chance of observing a result this extreme or more extreme by chance alone. It does not mean:

There is a 2% probability that H₀ is true.
The effect is large or important.
The result will replicate.

Why large n makes p-values misleading

With n = 1 000 000, the standard error of a mean difference is very small. A difference of $0.01 in average order value between two product page variants can achieve p < 0.001 even if the business cares only about differences above $2.00.

Common effect size measures

Context	Measure	Rule of thumb (Cohen)
Two means	Cohen’s d = (μ₁ - μ₂) / sᴅ	0.2 small, 0.5 medium, 0.8 large
Correlation	Pearson r	0.1 small, 0.3 medium, 0.5 large
Proportions	Odds ratio, relative risk	OR = 1 is null; OR > 2 often practically meaningful
ANOVA	η² = SSᴇᴏᴈᴊᴀᴋ / SSᴛᴂᴛᴀᴍ	0.01, 0.06, 0.14 small/medium/large

Worked example

An A/B test on a checkout button colour runs for 30 days with n = 500 000 users per variant. Results:

Control conversion rate: 3.210%
Variant conversion rate: 3.221%
Absolute difference: 0.011 percentage points
p-value: 0.003 (highly significant)
Relative lift: 0.34%

The result is statistically significant but whether 0.011 pp is practically meaningful depends on revenue impact, engineering cost, and strategic priorities — not on the p-value.

Best practices

Always report the effect size and its confidence interval alongside the p-value.
Pre-specify the minimum detectable effect (MDE) before running the test based on business significance, not statistical power alone.
For large-n settings, consider reporting the practical significance threshold explicitly and testing whether the CI excludes it.

Learn it properly A/B testing