What is multicollinearity, how does it harm regression, and how do you detect and fix it?
Multicollinearity occurs when two or more predictors are highly linearly correlated, inflating the variance of coefficient estimates and making them numerically unstable and uninterpretable. The Variance Inflation Factor (VIF) quantifies how much each coefficient's variance is inflated relative to an orthogonal design.
How to think about it
When predictors are correlated, the columns of X become nearly linearly dependent, making XᵀX nearly singular. The normal equation β = (XᵀX)⁻¹Xᵀy then produces coefficients with enormous variance — small changes in the data produce wildly different estimates.
What it does NOT break: predictions on in-distribution data remain good; only individual coefficient interpretation and inference break.
Variance Inflation Factor (VIF):
For predictor j, regress xⱼ on all other predictors. Let R²ⱼ be that regression’s R-squared.
VIF_j = 1 / (1 - R²_j)
- VIF = 1: no correlation with other predictors.
- VIF 1–5: moderate, usually acceptable.
- VIF above 10: serious multicollinearity; investigate.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
vif_data = pd.DataFrame({
"feature": X.columns,
"VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
})
print(vif_data.sort_values("VIF", ascending=False))
Remedies:
- Drop one of two highly correlated features.
- Apply PCA to decorrelate and reduce to orthogonal components.
- Use Ridge regression (L2) — it regularizes the covariance matrix by adding λI to
XᵀX, guaranteeing invertibility. - Collect more data (helps but rarely practical).