Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at AirbnbAsked at Lyft

What is a confounding variable, and how do you control for it?

For Data Scientist Data Analyst ML Engineer AI / LLM Engineer

The short answer

A confounding variable is associated with both the treatment and the outcome, creating a spurious apparent relationship between them. Controlling for confounders — through randomisation, stratification, regression adjustment, or matching — is essential to recover a valid causal estimate.

How to think about it

A confounder sits on a back-door path between treatment and outcome. Ignoring it biases the estimated effect; over-controlling (including mediators or colliders) can introduce bias of a different kind.

Formal definition

Variable Z is a confounder of the X → Y relationship if:

Z causes X (or at least predicts it).
Z causes Y (independently of X).
Z is not on the causal path from X to Y.

This creates the fork structure: X ← Z → Y, producing correlation between X and Y even when X has no causal effect on Y.

Worked example — shoe size and reading ability

Among primary school children, shoe size correlates positively with reading test scores. Shoe size does not cause reading ability. The confounder is age: older children have larger feet and better reading skills. Conditioning on age removes the spurious correlation entirely.

Methods to control for confounders

1. Randomisation (gold standard) Randomly assigning treatment balances all confounders — measured and unmeasured — in expectation. This is why RCTs are the benchmark.

2. Regression adjustment Include Z as a covariate in the regression model. The coefficient on X then estimates the effect of X holding Z fixed. Works well when the confounders are observed and the model is correctly specified.

3. Stratification / subgroup analysis Compare X vs outcome within strata of Z. Compute a weighted average across strata (Mantel-Haenszel). Direct analog to what prevents Simpson’s paradox from misleading.

4. Propensity score matching / weighting Estimate P(treatment = 1 | Z) for each unit. Match treated to untreated with similar scores, or reweight by inverse propensity. Balances observed confounders without assuming a linear outcome model.

5. Instrumental variables Use an instrument W that affects X but has no direct effect on Y (only through X). Useful when confounders are unobserved but an instrument exists (e.g., lottery-based assignment).

Mediators vs confounders vs colliders

Variable type	Structure	Controlling for it…
Confounder	X ← Z → Y	Removes bias ✓
Mediator	X → M → Y	Blocks the effect of interest ✗
Collider	X → C ← Y	Opens a spurious path ✗

What is a confounding variable, and how do you control for it?

Formal definition

Worked example — shoe size and reading ability

Methods to control for confounders

Mediators vs confounders vs colliders

Keep practising

Explore further