What is the difference between correlation and causation, and why does the distinction matter?
Correlation measures the strength of a linear relationship between two variables, but a shared cause, reverse causation, or coincidence can all produce correlation without any causal link. Treating correlation as causation leads to interventions that fail or cause harm.
How to think about it
Correlation quantifies co-movement; causation asserts that changing one variable produces a change in another. Conflating the two is one of the most common and consequential errors in applied analysis.
Why correlation is not causation
A Pearson correlation of r = 0.9 between X and Y means they move together strongly. It does not tell you:
- Whether X causes Y
- Whether Y causes X
- Whether a third variable Z drives both (confounding)
- Whether the relationship is spurious coincidence
Classic example: Ice cream sales and drowning rates correlate positively every summer. The confounder is hot weather — it drives both. Banning ice cream would not reduce drowning deaths.
Three causal structures that all produce correlation
| Structure | Description |
|---|---|
| X → Y | Direct causation |
| Y → X | Reverse causation |
| Z → X and Z → Y | Common cause (confounding) |
All three produce the same observed correlation between X and Y. Observational data alone cannot distinguish them.
How to establish causation
- Randomised controlled trial (RCT): randomly assign treatment so confounders are balanced.
- Natural experiment: exploit an exogenous shock (lottery, policy change) that mimics randomisation.
- Causal graph + do-calculus: encode assumptions explicitly and identify adjustment sets.
- Granger causality: time-series test — does past X predict future Y beyond past Y alone? (necessary but not sufficient.)