datarekha
Machine Learning Medium Asked at McKinseyAsked at AirbnbAsked at Goldman Sachs

What is multicollinearity, how does it harm regression, and how do you detect and fix it?

The short answer

Multicollinearity occurs when two or more predictors are highly linearly correlated, inflating the variance of coefficient estimates and making them numerically unstable and uninterpretable. The Variance Inflation Factor (VIF) quantifies how much each coefficient's variance is inflated relative to an orthogonal design.

How to think about it

When predictors are correlated, the columns of X become nearly linearly dependent, making XᵀX nearly singular. The normal equation β = (XᵀX)⁻¹Xᵀy then produces coefficients with enormous variance — small changes in the data produce wildly different estimates.

What it does NOT break: predictions on in-distribution data remain good; only individual coefficient interpretation and inference break.

Variance Inflation Factor (VIF):

For predictor j, regress xⱼ on all other predictors. Let R²ⱼ be that regression’s R-squared.

VIF_j = 1 / (1 - R²_j)

  • VIF = 1: no correlation with other predictors.
  • VIF 1–5: moderate, usually acceptable.
  • VIF above 10: serious multicollinearity; investigate.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

vif_data = pd.DataFrame({
    "feature": X.columns,
    "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
})
print(vif_data.sort_values("VIF", ascending=False))

Remedies:

  1. Drop one of two highly correlated features.
  2. Apply PCA to decorrelate and reduce to orthogonal components.
  3. Use Ridge regression (L2) — it regularizes the covariance matrix by adding λI to XᵀX, guaranteeing invertibility.
  4. Collect more data (helps but rarely practical).
Learn it properly Linear regression

Keep practising

All Machine Learning questions

Explore further

Skip to content