Ridge Regression

Definition

Core Statement

Ridge Regression is a regularized form of Multiple Linear Regression that adds an L2 penalty (the sum of squared coefficients) to the loss function. This penalty shrinks coefficients towards zero (but not exactly to zero), making the model more robust to multicollinearity and reducing overfitting.

Purpose

Handle Multicollinearity: When predictors are highly correlated, OLS estimates become unstable. Ridge stabilizes them.
Reduce Overfitting: By penalizing large coefficients, Ridge prevents the model from fitting noise.
Improve Prediction: Often yields better out-of-sample predictions than OLS in high-dimensional settings.

When to Use

Use Ridge When...

You have many predictors (potentially more than observations).
Predictors are highly correlated (VIF (Variance Inflation Factor) > 5).
You want to prevent overfitting without necessarily selecting a subset of features.

Ridge is NOT Ideal When...

You need feature selection (coefficients don't become exactly 0). Use Lasso Regression instead.
Interpretability is paramount (shrunk coefficients are harder to interpret).

Theoretical Background

The Objective Function

Ridge minimizes RSS plus a penalty on coefficient magnitude:

{\hat{β}}^{r i d g e} = \arg min_{β} {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} + λ \sum_{j = 1}^{p} β_{j}^{2}}

Term	Meaning
$\sum (y_{i} - {\hat{y}}_{i})^{2}$	Residual Sum of Squares (RSS). Fit the data.
$λ \sum β_{j}^{2}$	L2 Penalty. Shrink coefficients.
$λ$	Tuning parameter. Larger $λ$ = more shrinkage.

Bias-Variance Trade-off

$λ = 0$ : Equivalent to OLS. No bias, high variance.
$λ \to \infty$ : All $β \to 0$ . High bias, low variance (Null model).
Optimal $λ$ : Found via Cross-Validation. Minimizes test error.

Geometric View

The constraint region for Ridge ( $\sum β_{j}^{2} \leq t$ ) is a sphere (L2 ball). The OLS solution is projected onto this sphere.

Worked Numerical Example

Ridge vs OLS Coefficients

Scenario: Predicting House Price ( $y$ ) based on Size ( $x_{1}$ ) and Number of Rooms ( $x_{2}$ ).
Data: Features are standardized.
OLS Result (No Penalty):

$β_{1} = 100$
$β_{2} = 100$
(Note: Size and Rooms are highly correlated, inflating variances).

Ridge Result ( $λ = 10$ ):

The penalty term $λ (β_{1}^{2} + β_{2}^{2})$ forces coefficients down.
$β_{1}^{r i d g e} = 75$
$β_{2}^{r i d g e} = 75$
Bias is introduced (estimates are lower), but Variance is significantly reduced.

Ridge Result ( $λ = 1000$ ):

$β_{1} \approx 10$
$β_{2} \approx 10$
Model becomes too simple (Underfitting).

Assumptions

All standard Multiple Linear Regression assumptions apply, but Ridge is more robust to violations of:

No Multicollinearity (Ridge explicitly handles this).

Scaling is Mandatory

Because the penalty term $\sum β_{j}^{2}$ treats all coefficients equally, variables must be standardized (mean 0, variance 1) before fitting. Otherwise, variables with larger scales will be penalized more.

Limitations

Pitfalls

No Feature Selection: All predictors remain in the model. For sparse models, use Lasso Regression.
Interpretation: Shrunk coefficients are biased; direct interpretation is less intuitive.
Hyperparameter Tuning Required: Must use cross-validation to find optimal $λ$ .

Python Implementation

from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Fit RidgeCV (Cross-Validation for Lambda Selection)
alphas = np.logspace(-3, 3, 100)  # Range of lambdas
ridge = RidgeCV(alphas=alphas, cv=5).fit(X_scaled, y)

print(f"Optimal Lambda (alpha): {ridge.alpha_:.4f}")
print(f"Coefficients: {ridge.coef_}")
print(f"R-squared: {ridge.score(X_scaled, y):.4f}")

R Implementation

library(glmnet)

# 1. Prepare Matrix (glmnet requires matrix, not data frame)
X <- as.matrix(df[, -target_col])
y <- df$target

# 2. Fit Ridge with Cross-Validation (alpha = 0 for Ridge)
cv_fit <- cv.glmnet(X, y, alpha = 0)

# 3. Plot Error vs Lambda
plot(cv_fit)

# 4. Best Lambda
cat("Optimal Lambda:", cv_fit$lambda.min, "\n")

# 5. Coefficients at Best Lambda
coef(cv_fit, s = "lambda.min")

Interpretation Guide

Scenario	Interpretation
Large $λ$ selected	Strong regularization needed; multicollinearity or overfitting was likely.
Coefficients shrink but none are 0	All features contribute, but their impact is moderated.
OLS coefficient: 50, Ridge coefficient: 25	Ridge has shrunk the effect by 50% to prevent overfitting.

Lasso Regression - L1 penalty; performs feature selection.
Elastic Net - Combines L1 and L2 penalties.
Multiple Linear Regression - The unregularized baseline.
Cross-Validation - Required for selecting $λ$ .
Bias-Variance Trade-off