p-values are not the probability you are right

In 2014 a team at a large e-commerce company shipped a redesigned checkout button. Their A/B test hit p = 0.03 after four days. They shipped. Conversion dropped 2.1 percent over the following quarter. The number that felt like certainty was, in a meaningful sense, noise wearing a suit.

The p-value is the most consequential number in applied analytics that almost nobody defines correctly. Not because it is obscure — it appears on every experiment dashboard, in every data science interview, in regulatory filings across pharma, finance, and tech. Because its plain-English meaning and its mathematical definition are nearly opposite, and the distance between them is exactly where bad decisions live.

What it actually says

Here is the definition, stated once without softening: a p-value is the probability of observing data at least as extreme as what you collected, assuming the null hypothesis is true.

The null hypothesis is the boring one — no effect, no difference, nothing is happening. The p-value does not tell you the probability the null is true. It tells you how weird your data would be in a world where the null were true.

A p-value of 0.04 means: if there really were no difference between A and B, you would see an outcome this extreme about 4 percent of the time just from random sampling variation. That is it. The number lives entirely inside the null world. It says nothing about what happens outside it.

This is not a pedantic distinction. It changes every decision downstream.

The inversion people make

Most analysts read p = 0.04 and think: “there is a 96 percent chance this effect is real.” That thought is an inversion of the conditional. Mathematically, the p-value is P(data | null is true). What people want is P(null is true | data) — how likely is the null given what we saw. These two quantities are not equal, and they are not even close in practice.

To get from one to the other you need Bayes’ theorem (a rule for updating probabilities when you see new evidence). You need a prior — your honest estimate of how likely the effect was to exist before you ran the test. And priors are not free. In most online experiments, a randomly chosen change to a product has maybe a 10 to 30 percent chance of producing a real lift. That is not pessimism; it is the empirical base rate from companies that track it.

Walk through the arithmetic. Say your team runs 100 experiments. Twenty of them have a real effect (a 20 percent prior). With a significance threshold of 0.05, your test will catch roughly 80 percent of true effects (this is statistical power — the test’s sensitivity) and flag 5 percent of null experiments as positive (that is exactly what alpha = 0.05 means). So out of 100 experiments you get: 16 true positives, 4 false positives (5 percent of the 80 null experiments), and 4 missed true effects. Of your 20 significant results, 4 are false. That is a 20 percent false-positive rate on your declared winners — even with p less than 0.05 on all of them.

Drop the prior to 10 percent and run the same math. You get 7 true positives, 4.5 false positives. Now roughly 39 percent of your wins are wrong. The p-value never changed. The dashboard always said green.

The p-value lives entirely inside the null-hypothesis world. The number people want requires a prior probability that the test never touches.

The base-rate trap in practice

The base-rate trap (the phenomenon where a test with low prior probability produces mostly false positives even at small p) is not a theoretical curiosity. It is the dominant failure mode in growth analytics, clinical trial replication, and financial factor research.

In pharmaceutical trials it is called the “winner’s curse”: promising early trials with small samples and low-prior hypotheses look spectacular, then fail to replicate at scale. In growth teams it shows up as features that “win” in testing and underperform in production. In quantitative finance it is why most published anomalies (price patterns that supposedly beat the market) decay or disappear entirely once they are known.

The fix is not to distrust experiments. It is to reason about your prior before reading the result. A company testing a button color change has a low prior that color alone drives revenue. A company testing a redesigned pricing page that removes friction has a much higher prior. Same p = 0.04, different interpretations, different actions.

Bayesian A/B testing frameworks make this explicit by requiring you to state a prior distribution over effect sizes and updating it with observed data. The result is not a binary yes/no but a posterior distribution — your updated belief about how large the effect probably is. This is harder to sell to stakeholders who want a green checkmark, which is why frequentist p-values persist. But knowing the limitation lets you compensate: replicate before shipping, require pre-registration of hypotheses (committing to the hypothesis before seeing data, preventing post-hoc rationalization), and treat p-values as a first filter, not a verdict.

Statistical significance is not practical significance

Even when a p-value is meaningful — even when the prior is decent and the test is clean — statistical significance (the probability threshold) and practical significance (whether the effect is large enough to matter) are separate axes entirely.

With a large enough sample you can detect effects that are statistically real and operationally irrelevant. Run a search-ranking experiment on a hundred-million-query corpus and you can confidently detect a 0.0001 percent change in click-through rate. The p-value will be vanishingly small. The business impact will be indistinguishable from noise against quarterly revenue. Publishing that as a win is technically accurate and managerially useless.

The relevant question is not “is the effect nonzero?” but “is the effect large enough to justify shipping this code, absorbing this complexity, and accepting this maintenance burden?” That question cannot be answered by a p-value alone. It requires an effect size estimate (the magnitude of the difference, not just its direction) and a minimum detectable effect (the smallest effect size your team would actually care about, defined before the test runs).

Industry practice at companies with mature experimentation platforms — the Netflixes, Airbnbs, and Booking.coms of the world — is to power experiments for a minimum detectable effect defined by the product team, not the statistician. If a 0.5 percent conversion lift would not change a product decision, the experiment is designed to have power against 2 percent and anything below is treated as noise regardless of p-value.

Peeking destroys the guarantee

Here is a failure mode that is technically subtle but operationally ubiquitous: you start an experiment, you check results on day two, the p-value is 0.06, you wait, on day four it drops to 0.04, you call it. This is called peeking, and it silently destroys the statistical guarantee you thought you had.

The mathematics of a frequentist test assumes the sample size is fixed in advance and the test is run once. When you check early and stop if the number looks good, you are running multiple tests on the same data. Each check is a chance to cross the threshold by luck. If you check ten times at alpha = 0.05, your actual false-positive rate climbs to around 19 percent — nearly four times what the dashboard implies. Check twenty times and you are over 36 percent. You may as well flip a coin.

Peeking is almost universal in practice because dashboards exist and humans look at them. The solutions are structural. Sequential testing methods (tests designed to allow continuous monitoring without inflating error rates — SPRT and its relatives) let you look whenever you want at the cost of a slightly larger required sample. Fixed-horizon testing with a blackout period (literally hiding the result until the pre-specified sample size is reached) works if you can enforce it culturally. Bayesian testing sidesteps the problem because it does not make a binary claim at a fixed threshold — you can look at any time and read the posterior, which is an honest state of belief, not a hypothesis test.

A p-value in a live experiment is a random walk. Stopping when it crosses 0.05 is not the same as running a single clean test — it is multiple tests on growing data.

The number is not corrupt — the interpretation is

None of this means p-values are useless. They are excellent at one specific task: quantifying how incompatible your data is with the null hypothesis, measured on a scale everyone has agreed to use. That is genuinely valuable. The problem is not the p-value; the problem is the cargo-cult interpretation that treats a number designed to control long-run error rates as a statement about the probability of any individual claim being true.

The American Statistical Association published a statement on this in 2016. The replication crisis in psychology and social science forced the conversation into the open. In 2019 the ASA followed up recommending that the language of “statistically significant” be retired entirely, replaced by effect sizes, confidence intervals, and explicit reasoning about prior plausibility. That shift has been slow in industry — dashboards were built around the green checkmark, and product teams are not reading ASA statements — but the underlying understanding is spreading among practitioners who have been burned.

The signal of a mature data organization is not the sophistication of its models. It is whether analysts instinctively ask “what was the prior?” before reading an experiment result, whether minimum detectable effects are defined before launch, and whether sequential testing is the default rather than the exception.

What to actually do

Do not stop reporting p-values — your stakeholders expect them and they carry real information. Do stop treating them as the answer. Before you run an experiment, state the minimum effect that would actually change a product decision; if the test cannot detect that effect at 80 percent power, do not run it at the current sample size. Form a prior. If you are testing an idea that 1 in 10 similar ideas has validated, a p = 0.04 result still deserves skepticism, not celebration.

Read confidence intervals (the range of effect sizes consistent with your data at a given level of confidence) as carefully as you read the p-value. A significant result with a confidence interval that spans +0.1 percent to +8 percent conversion is a very different decision from one that spans +1.9 percent to +2.1 percent. The p-value hides that difference entirely.

And when your dashboard tempts you to peek — when the result looks close and shipping day is next Tuesday — remember that the guarantee you thought you bought expires the moment you look early and act on what you see.

The p-value is a narrow, precisely defined tool. It survived decades of misuse because it is easy to compute, easy to report, and produces a number small enough to feel decisive. Its actual meaning is stranger and less comfortable: it tells you how often a world with no real effect would trick you, not how often you are right. Keeping those two things separate is the whole job.