Ridge Regression

Ridge Regression

Definition

Core Statement

Ridge Regression is a regularized form of Multiple Linear Regression that adds an L2 penalty (the sum of squared coefficients) to the loss function. This penalty shrinks coefficients towards zero (but not exactly to zero), making the model more robust to multicollinearity and reducing overfitting.


Purpose

  1. Handle Multicollinearity: When predictors are highly correlated, OLS estimates become unstable. Ridge stabilizes them.
  2. Reduce Overfitting: By penalizing large coefficients, Ridge prevents the model from fitting noise.
  3. Improve Prediction: Often yields better out-of-sample predictions than OLS in high-dimensional settings.

When to Use

Use Ridge When...

  • You have many predictors (potentially more than observations).
  • Predictors are highly correlated (VIF (Variance Inflation Factor) > 5).
  • You want to prevent overfitting without necessarily selecting a subset of features.

Ridge is NOT Ideal When...

  • You need feature selection (coefficients don't become exactly 0). Use Lasso Regression instead.
  • Interpretability is paramount (shrunk coefficients are harder to interpret).


Theoretical Background

The Objective Function

Ridge minimizes RSS plus a penalty on coefficient magnitude:

β^ridge=argminβ{i=1n(yiy^i)2+λj=1pβj2}
Term Meaning
(yiy^i)2 Residual Sum of Squares (RSS). Fit the data.
λβj2 L2 Penalty. Shrink coefficients.
λ Tuning parameter. Larger λ = more shrinkage.

Bias-Variance Trade-off

Geometric View

The constraint region for Ridge (βj2t) is a sphere (L2 ball). The OLS solution is projected onto this sphere.


Worked Numerical Example

Ridge vs OLS Coefficients

Scenario: Predicting House Price (y) based on Size (x1) and Number of Rooms (x2).
Data: Features are standardized.
OLS Result (No Penalty):

  • β1=100
  • β2=100
  • (Note: Size and Rooms are highly correlated, inflating variances).

Ridge Result (λ=10):

  • The penalty term λ(β12+β22) forces coefficients down.
  • β1ridge=75
  • β2ridge=75
  • Bias is introduced (estimates are lower), but Variance is significantly reduced.

Ridge Result (λ=1000):

  • β110
  • β210
  • Model becomes too simple (Underfitting).

Assumptions

All standard Multiple Linear Regression assumptions apply, but Ridge is more robust to violations of:

Scaling is Mandatory

Because the penalty term βj2 treats all coefficients equally, variables must be standardized (mean 0, variance 1) before fitting. Otherwise, variables with larger scales will be penalized more.


Limitations

Pitfalls

  1. No Feature Selection: All predictors remain in the model. For sparse models, use Lasso Regression.
  2. Interpretation: Shrunk coefficients are biased; direct interpretation is less intuitive.
  3. Hyperparameter Tuning Required: Must use cross-validation to find optimal λ.


Python Implementation

from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Fit RidgeCV (Cross-Validation for Lambda Selection)
alphas = np.logspace(-3, 3, 100)  # Range of lambdas
ridge = RidgeCV(alphas=alphas, cv=5).fit(X_scaled, y)

print(f"Optimal Lambda (alpha): {ridge.alpha_:.4f}")
print(f"Coefficients: {ridge.coef_}")
print(f"R-squared: {ridge.score(X_scaled, y):.4f}")

R Implementation

library(glmnet)

# 1. Prepare Matrix (glmnet requires matrix, not data frame)
X <- as.matrix(df[, -target_col])
y <- df$target

# 2. Fit Ridge with Cross-Validation (alpha = 0 for Ridge)
cv_fit <- cv.glmnet(X, y, alpha = 0)

# 3. Plot Error vs Lambda
plot(cv_fit)

# 4. Best Lambda
cat("Optimal Lambda:", cv_fit$lambda.min, "\n")

# 5. Coefficients at Best Lambda
coef(cv_fit, s = "lambda.min")

Interpretation Guide

Scenario Interpretation
Large λ selected Strong regularization needed; multicollinearity or overfitting was likely.
Coefficients shrink but none are 0 All features contribute, but their impact is moderated.
OLS coefficient: 50, Ridge coefficient: 25 Ridge has shrunk the effect by 50% to prevent overfitting.