VIF (Variance Inflation Factor)
VIF (Variance Inflation Factor)
Definition
Core Statement
Variance Inflation Factor (VIF) quantifies how much the variance of an estimated regression coefficient is inflated due to multicollinearity with other predictors. It diagnoses whether predictors in Multiple Linear Regression are too similar to each other.
Purpose
- Detect multicollinearity (high inter-correlation among predictors).
- Identify which predictors are redundant.
- Justify removing or combining correlated variables.
When to Use
Calculate VIF When...
- Building a Multiple Linear Regression model with many predictors.
- Coefficients have unexpected signs or magnitudes.
- Standard errors are unusually large.
Theoretical Background
Formula
For predictor
where
Interpretation
| VIF | Interpretation |
|---|---|
| 1 | No correlation with other predictors. (Ideal). |
| 1 - 5 | Moderate correlation. (Usually acceptable). |
| 5 - 10 | High correlation. (Concerning). |
| > 10 | Severe multicollinearity. (Action required). |
Why Multicollinearity is Bad
- Unstable Coefficients: Small changes in data cause large swings in
. - Inflated Standard Errors: P-values become unreliable; real effects appear insignificant.
- Interpretation Ambiguity: Which variable is "really" responsible?
Assumptions
VIF is a diagnostic, not a test. It has no formal assumptions but is only meaningful in the context of OLS regression.
Limitations
Pitfalls
- Threshold is arbitrary: VIF > 5 is a guideline, not a law. Context matters.
- VIF for the intercept is meaningless. Ignore it.
- Structural multicollinearity: Interaction terms (e.g.,
and ) naturally have high VIF; center variables first.
Solutions for High VIF
- Remove one variable: If
and measure the same thing, drop one. - Combine variables: Create an index or use Principal Component Analysis (PCA).
- Regularization: Use Ridge Regression or Lasso Regression.
Python Implementation
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# Prepare Data (Must include constant for correct calculation)
X = add_constant(df[['Age', 'Income', 'Education']])
# Calculate VIF
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data[vif_data['Variable'] != 'const'])
R Implementation
library(car)
model <- lm(Y ~ Age + Income + Education, data = df)
# Calculate VIF
vif(model)
# Generalized VIF (for categorical variables with >2 levels)
# Output: GVIF^(1/(2*Df)) > 2 is concerning
Interpretation Guide
| Output | Interpretation |
|---|---|
| VIF(Income) = 1.2 | No multicollinearity issue with Income. |
| VIF(Education) = 8.5 | Education is highly correlated with other predictors. Investigate. |
| All VIF < 5 | Model is reasonably free from multicollinearity. |
Related Concepts
- Multiple Linear Regression
- Ridge Regression - Handles multicollinearity.
- Principal Component Analysis (PCA) - Dimension reduction.
- Correlation Matrix