Correlation isn't causation — but here's what it actually is

In 1973, Francis Anscombe constructed four datasets. Each had eleven points. Each had an r of 0.816, a slope of 0.5, and an intercept of 3.0. He published them to make one point: the summary statistic you trust so completely was designed for exactly one shape of relationship, and the world does not care about your design constraints.

That warning is half a century old. We still talk about correlation as though it were a dial that goes from “no relationship” to “strong relationship.” It isn’t. It is a dial that goes from “perfect negative linear relationship” to “perfect positive linear relationship,” with a specific, loaded meaning for every stop in between.

Getting that definition right changes how you read every scatter plot, every coefficient table, and every claim that begins with “studies show a strong correlation between…”

What the number actually encodes

Pearson’s r — the standard correlation coefficient you encounter unless told otherwise — has a formula built from z-scores. Take each observation of X, subtract the mean of X, divide by the standard deviation of X. Do the same for Y. Multiply the two z-scores together for each point. Average across all points. That average is r.

The division by standard deviation is the key move. It strips away units. It does not matter whether X is measured in dollars or kilograms or milliseconds. The z-scores live on a universal scale, and their product captures only the direction and tightness of co-movement. When X and Y both tend to be above their own means at the same time, those products are positive and r is positive. When one tends to be high while the other is low, r is negative. When the two are independent in the linear sense, the positive and negative products cancel and r lands near zero.

The formula forces r into [-1, 1]. An r of 1 means the data lies exactly on an upward-sloping line. An r of -1 means exactly a downward-sloping line. An r of 0 means the linear component of the relationship is zero — which is not the same thing as “no relationship.”

That last sentence deserves its own paragraph.

A dataset where Y = X squared centered at zero has r near 0. The relationship is perfect, deterministic, and completely invisible to r. The parabola has symmetric positive and negative products that annihilate each other.

This is not a pathological edge case. It is the natural behavior of a statistic that was engineered for lines. The moment a relationship curves, r loses resolution. The moment a relationship involves clusters, outliers, or any structure that is not a single elliptical cloud, r misreports.

Anscombe’s four datasets are the canonical demonstration. The first is the well-behaved cloud r was designed for. The second is a perfect parabola with a low-ish r. The third is a near-perfect line with one catastrophic outlier that drags r down. The fourth has all points at a single X value except one leverage point that singlehandedly determines the slope and thus the correlation. Same r, four completely different worlds.

The practical implication: r is a screening tool, not a verdict. A high r tells you there is a linear signal worth examining. A low r tells you only that there is no linear signal — not that there is nothing.

Same r can mean radically different things. The rightmost panel has a strong, deterministic relationship that Pearson’s r cannot see.

Why correlation appears without causation

Suppose you observe that cities with more fire stations have more building fires. r is positive and fairly large. Should you conclude that fire stations cause fires? Of course not. Population size drives both: larger cities need more stations and also have more fires. Population is the confounder — a third variable that causes both X and Y independently, manufacturing a correlation between them that has no direct causal meaning.

Confounders are the main reason correlation diverges from causation. They are not exotic. They are the default. In almost any observational dataset — one where no one assigned treatments randomly — there are background variables that influenced both what was measured and what happened. Those variables braid together signals you wanted to keep separate.

Reverse causation is the second trap. You see that people who exercise more are happier. Does exercise cause happiness, or does being happier make you more likely to exercise? Both directions are plausible. The correlation cannot tell you which arrow is real.

And then there is coincidence. Tyler Vigen’s website catalogs hundreds of spurious correlations found by mining enough variables against each other: the per-capita consumption of mozzarella in the US tracks doctorates awarded in civil engineering at r above 0.95. No one believes mozzarella trains engineers. But if you test enough pairs, some will line up by chance, especially across time series where both variables share a common trend.

These three mechanisms — confounding, reverse causation, chance — can each produce a high r without any causal signal. This is why the slogan exists. What the slogan never explains is what you would need to move from correlation to a credible causal claim.

The causal ladder

Judea Pearl, the computer scientist who formalized modern causal inference, describes a hierarchy of questions. At the bottom: “What is the correlation between A and B?” At the middle: “What would happen to B if I intervened and changed A?” At the top: “What would have happened to B if A had been different, in this specific case?” Each rung requires more than data; it requires assumptions about the structure of the world.

The cleanest path to the middle rung — intervention — is a randomized controlled experiment. Assign the treatment (A) randomly. Randomization breaks the link between A and every confounder, observed or not. Now A and Y are associated only if A actually influences Y. The correlation you observe maps to causation because you designed out the alternatives.

When randomization is impossible — you cannot randomly assign people to smoke, or to be born in poverty — you need to do the work of closing off alternative explanations one by one. That is what regression with controls, instrumental variables, difference-in-differences, and regression discontinuity designs are trying to do: argue that after accounting for this set of confounders, in this particular data-generating context, the remaining association is causal. Every such argument is a bet on an assumption. The assumption is usually not testable from the data itself.

This is the honest version of the story. Causal claims are not read off data. They are argued from data, plus assumptions about what the data-generating process looks like.

What a correlation actually tells you

Given all that, what is r good for?

Quite a lot, actually — as long as you respect what it measures. It tells you whether a linear predictive relationship exists between two variables. It tells you how much variance in one variable is explained by a linear model on the other: r^2 is that fraction exactly. An r of 0.7 means r^2 = 0.49, so 49 percent of the variance in Y is accounted for by the linear fit on X. The other 51 percent is noise, nonlinearity, or other variables.

It is a useful screening statistic. In feature selection for a model, you might look for features with r above some threshold against the target, as a quick filter before something more expensive. In quality control, a sudden change in the correlation structure between two process variables can signal that something broke upstream.

It is also a way to formalize intuitions about association. The question “do taller people tend to weigh more?” is a question about correlation. The answer, r around 0.6 in most adult samples, is both correct and immediately interpretable: there is a moderate positive linear tendency, but a lot of variation that height alone does not explain.

The problem is not the statistic. The problem is treating it as a universal tool for a question it was not built to answer.

Three routes produce correlation without causation. A causal claim requires closing all three doors — ideally by design, not by argument alone.

The real lesson in the slogan

“Correlation is not causation” has become a thought-terminating cliché. People deploy it to dismiss findings they dislike. That is not the right use.

The right use is as a checklist prompt. When you see a reported correlation, ask: what third variable could drive both of these? Which direction of causation is more plausible? How many other pairs were tested before this one was published? Those three questions cover the three failure modes above. If you have good answers — the confounder is implausible or has been measured and controlled for, the causal direction is supported by timing or mechanism, the analysis was pre-registered — then the correlation is doing real epistemic work.

The slogan should make you ask harder questions, not stop asking them.

r is a number from -1 to 1 that measures one specific kind of relationship: the linear co-movement of two variables. It is correct, well-defined, and often useful. It is also one of the most frequently misread numbers in public discourse. The misreading is not that people think it means causation. The misreading is that they think a high r means a strong relationship, when it means only a strong linear relationship — and that they forget what it would actually take to say something caused something else.

The next time you read that X correlates with Y, the right response is not skepticism. It is curiosity: what is the shape of this relationship? What else varies with X? What would you need to see to believe the arrow goes from X to Y? Those are harder questions than the slogan. They are also the only questions worth asking.