Ridge Regression
Ridge Regression
Definition
Ridge Regression is a regularized form of Multiple Linear Regression that adds an L2 penalty (the sum of squared coefficients) to the loss function. This penalty shrinks coefficients towards zero (but not exactly to zero), making the model more robust to multicollinearity and reducing overfitting.
Purpose
- Handle Multicollinearity: When predictors are highly correlated, OLS estimates become unstable. Ridge stabilizes them.
- Reduce Overfitting: By penalizing large coefficients, Ridge prevents the model from fitting noise.
- Improve Prediction: Often yields better out-of-sample predictions than OLS in high-dimensional settings.
When to Use
- You have many predictors (potentially more than observations).
- Predictors are highly correlated (VIF (Variance Inflation Factor) > 5).
- You want to prevent overfitting without necessarily selecting a subset of features.
- You need feature selection (coefficients don't become exactly 0). Use Lasso Regression instead.
- Interpretability is paramount (shrunk coefficients are harder to interpret).
Theoretical Background
The Objective Function
Ridge minimizes RSS plus a penalty on coefficient magnitude:
| Term | Meaning |
|---|---|
| Residual Sum of Squares (RSS). Fit the data. | |
| L2 Penalty. Shrink coefficients. | |
| Tuning parameter. Larger |
Bias-Variance Trade-off
: Equivalent to OLS. No bias, high variance. : All . High bias, low variance (Null model). - Optimal
: Found via Cross-Validation. Minimizes test error.
Geometric View
The constraint region for Ridge (
Worked Numerical Example
Scenario: Predicting House Price (
Data: Features are standardized.
OLS Result (No Penalty):
- (Note: Size and Rooms are highly correlated, inflating variances).
Ridge Result (
- The penalty term
forces coefficients down. - Bias is introduced (estimates are lower), but Variance is significantly reduced.
Ridge Result (
- Model becomes too simple (Underfitting).
Assumptions
All standard Multiple Linear Regression assumptions apply, but Ridge is more robust to violations of:
Because the penalty term
Limitations
- No Feature Selection: All predictors remain in the model. For sparse models, use Lasso Regression.
- Interpretation: Shrunk coefficients are biased; direct interpretation is less intuitive.
- Hyperparameter Tuning Required: Must use cross-validation to find optimal
.
Python Implementation
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
import numpy as np
# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Fit RidgeCV (Cross-Validation for Lambda Selection)
alphas = np.logspace(-3, 3, 100) # Range of lambdas
ridge = RidgeCV(alphas=alphas, cv=5).fit(X_scaled, y)
print(f"Optimal Lambda (alpha): {ridge.alpha_:.4f}")
print(f"Coefficients: {ridge.coef_}")
print(f"R-squared: {ridge.score(X_scaled, y):.4f}")
R Implementation
library(glmnet)
# 1. Prepare Matrix (glmnet requires matrix, not data frame)
X <- as.matrix(df[, -target_col])
y <- df$target
# 2. Fit Ridge with Cross-Validation (alpha = 0 for Ridge)
cv_fit <- cv.glmnet(X, y, alpha = 0)
# 3. Plot Error vs Lambda
plot(cv_fit)
# 4. Best Lambda
cat("Optimal Lambda:", cv_fit$lambda.min, "\n")
# 5. Coefficients at Best Lambda
coef(cv_fit, s = "lambda.min")
Interpretation Guide
| Scenario | Interpretation |
|---|---|
| Large |
Strong regularization needed; multicollinearity or overfitting was likely. |
| Coefficients shrink but none are 0 | All features contribute, but their impact is moderated. |
| OLS coefficient: 50, Ridge coefficient: 25 | Ridge has shrunk the effect by 50% to prevent overfitting. |
Related Concepts
- Lasso Regression - L1 penalty; performs feature selection.
- Elastic Net - Combines L1 and L2 penalties.
- Multiple Linear Regression - The unregularized baseline.
- Cross-Validation - Required for selecting
. - Bias-Variance Trade-off