What is a confounding variable, and how do you control for it?
A confounding variable is associated with both the treatment and the outcome, creating a spurious apparent relationship between them. Controlling for confounders — through randomisation, stratification, regression adjustment, or matching — is essential to recover a valid causal estimate.
How to think about it
A confounder sits on a back-door path between treatment and outcome. Ignoring it biases the estimated effect; over-controlling (including mediators or colliders) can introduce bias of a different kind.
Formal definition
Variable Z is a confounder of the X → Y relationship if:
- Z causes X (or at least predicts it).
- Z causes Y (independently of X).
- Z is not on the causal path from X to Y.
This creates the fork structure: X ← Z → Y, producing correlation between X and Y even when X has no causal effect on Y.
Worked example — shoe size and reading ability
Among primary school children, shoe size correlates positively with reading test scores. Shoe size does not cause reading ability. The confounder is age: older children have larger feet and better reading skills. Conditioning on age removes the spurious correlation entirely.
Methods to control for confounders
1. Randomisation (gold standard) Randomly assigning treatment balances all confounders — measured and unmeasured — in expectation. This is why RCTs are the benchmark.
2. Regression adjustment Include Z as a covariate in the regression model. The coefficient on X then estimates the effect of X holding Z fixed. Works well when the confounders are observed and the model is correctly specified.
3. Stratification / subgroup analysis Compare X vs outcome within strata of Z. Compute a weighted average across strata (Mantel-Haenszel). Direct analog to what prevents Simpson’s paradox from misleading.
4. Propensity score matching / weighting Estimate P(treatment = 1 | Z) for each unit. Match treated to untreated with similar scores, or reweight by inverse propensity. Balances observed confounders without assuming a linear outcome model.
5. Instrumental variables Use an instrument W that affects X but has no direct effect on Y (only through X). Useful when confounders are unobserved but an instrument exists (e.g., lottery-based assignment).
Mediators vs confounders vs colliders
| Variable type | Structure | Controlling for it… |
|---|---|---|
| Confounder | X ← Z → Y | Removes bias ✓ |
| Mediator | X → M → Y | Blocks the effect of interest ✗ |
| Collider | X → C ← Y | Opens a spurious path ✗ |