Multiple Linear Regression
Multiple Linear Regression
Definition
Multiple Linear Regression (MLR) extends simple linear regression to model the relationship between a single continuous dependent variable (
Purpose
- Isolate Effects: Understand the effect of
on , holding constant (Ceteris Paribus). - Control for Confounders: Reduce bias by including variables that might otherwise distort the relationship.
- Build Predictive Models: Create more accurate predictions than SLR.
When to Use
- You have multiple continuous predictors for one continuous outcome.
- You want to control for confounders.
- You believe the relationship between each predictor and the outcome is linear.
- The outcome is categorical (use Logistic Regression).
- Predictors are highly correlated (Multicollinearity problem).
Theoretical Background
The Model Equation
Interpretation of
Adjusted
Standard
Use Adjusted
The F-Test (Overall Model Significance)
Tests whether the model as a whole explains significantly more variance than a model with just the intercept.
: (Model has no explanatory power). - If
: Reject . The model is useful.
Assumptions (LINE + No Multicollinearity)
All assumptions from Simple Linear Regression apply, plus one critical addition:
Limitations
- Multicollinearity: If
and are highly correlated (e.g., ), the model cannot determine which one is responsible for the effect. Standard errors inflate, p-values become unreliable. - Overfitting: Adding too many predictors can make the model fit noise rather than signal. Use Ridge Regression or Lasso Regression for regularization.
- Specification Errors: Including irrelevant variables or omitting relevant ones biases estimates.
Python Implementation
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# 1. Prepare Data
X = df[['Age', 'Experience', 'Education']]
y = df['Salary']
X = sm.add_constant(X)
# 2. Check VIF Before Fitting
vif = pd.DataFrame()
vif['Variable'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF:\n", vif[vif['Variable'] != 'const'])
# Rule: VIF > 5 is concerning; VIF > 10 is severe.
# 3. Fit Model
model = sm.OLS(y, X).fit()
print(model.summary())
R Implementation
# 1. Fit Model
model <- lm(Salary ~ Age + Experience + Education, data = df)
# 2. Results
summary(model)
# 3. Check VIF
library(car)
vif(model)
# GVIF^(1/(2*Df)) > 2 is concerning
# 4. Confidence Intervals
confint(model)
# 5. Compare Models (ANOVA)
model_reduced <- lm(Salary ~ Age, data = df)
anova(model_reduced, model)
# Tests if additional variables significantly improve fit
Worked Numerical Example
Data: 100 employees
Model: Salary = β₀ + β₁(Age) + β₂(Experience) + β₃(Education)
Results:
- β₀ (Intercept) = 20,000
- β₁ (Age) = 500, p < 0.05
- β₂ (Experience) = 1,200, p < 0.001
- β₃ (Education) = 3,000, p < 0.01
- Adjusted R² = 0.78
- F-statistic: p < 0.001
Interpretation:
- For a 30-year-old with 5 years experience and bachelor's degree (Education=4):
Predicted Salary = 20,000 + 500(30) + 1,200(5) + 3,000(4) = $53,000 - If experience increases to 6 years (holding age and education constant):
New Salary = $53,000 + $1,200 = $54,200
Interpretation Guide
| Output | Example Value | Interpretation | Edge Case Notes |
|---|---|---|---|
| 500 | Each additional year increases Salary by $500, holding Experience and Education constant. | If Age and Experience are correlated (r=0.9), this estimate is unstable. Check VIF. | |
| -200 | Counterintuitive negative sign suggests Simpson's Paradox or multicollinearity. | Investigate: Age may proxy for obsolete skills when controlling for Education. | |
| VIF for Experience | 8.5 | High multicollinearity. Standard errors inflated. | Consider: Remove Experience, or combine Age+Experience into "Career Length". |
| VIF for Experience | 1.2 | No multicollinearity concern. | Coefficient estimate is reliable. |
| Adjusted |
0.78 | 78% of variance explained (penalized for # predictors). | Compare to R² (0.82): penalty is small, model complexity justified. |
| Adjusted |
0.15 | Model explains little variance even after penalty. | Predictors may be irrelevant, or relationship is non-linear. |
| F-statistic p < 0.001 | Model is significant overall. | At least one predictor has non-zero effect. | Individual p-values may still be >0.05 due to multicollinearity. |
| F-statistic p = 0.30 | Model has no explanatory power. | All predictors together don't predict Y better than intercept-only model. |
Common Pitfall Example
Scenario: Regressing Income on Years_of_Education and Years_of_Experience.
Problem: Education and Experience are highly correlated (r = 0.85) because:
- High education → late career start → less experience
- Low education → early career start → more experience
Result:
- VIF(Education) = 7.2, VIF(Experience) = 7.2
- β_Education = $2,000, p = 0.08 (not significant!)
- β_Experience = $1,500, p = 0.12 (not significant!)
But: When you remove one variable:
- Model with only Education: β = $4,000, p < 0.001
- Model with only Experience: β = $3,500, p < 0.001
Lesson: Both variables ARE important, but multicollinearity makes them appear non-significant when included together. Use Ridge regression or create composite variable (e.g., "Career Investment Score").
Related Concepts
- Simple Linear Regression - The single-predictor case.
- VIF (Variance Inflation Factor) - Multicollinearity diagnostic.
- Ridge Regression - Regularization to handle multicollinearity.
- Lasso Regression - Feature selection via L1 penalty.
- Adjusted R-squared