Lasso Regression
Lasso Regression
Definition
Lasso (Least Absolute Shrinkage and Selection Operator) is a regularized regression method that adds an L1 penalty (the sum of absolute values of coefficients) to the loss function. Unlike Ridge, Lasso can shrink coefficients to exactly zero, effectively performing automatic feature selection.
Purpose
- Feature Selection: Identify the most important predictors by eliminating irrelevant ones.
- Reduce Overfitting: Penalize model complexity.
- Build Sparse Models: Create interpretable models with fewer variables.
When to Use
- You have many features and suspect only a few are important.
- You want automatic feature selection.
- You need an interpretable sparse model.
- You have highly correlated predictors (Lasso arbitrarily picks one and ignores others). Use Elastic Net instead.
- All features are genuinely important. Ridge Regression may perform better.
Theoretical Background
The Objective Function
Lasso minimizes RSS plus an L1 penalty:
Why Does Lasso Give Zero Coefficients?
Geometric Interpretation:
- The L1 constraint region (
) is a diamond (polytope) with corners on the axes. - When the elliptical contours of the RSS function hit the corner of the diamond, the corresponding
is exactly 0.
Lasso vs Ridge
| Feature | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | $\lambda \sum | |
| Coefficients | Shrunk towards zero | Some shrunk to exactly zero |
| Feature Selection | No | Yes |
| Correlated Predictors | Keeps all; shrinks equally | Keeps one; drops others |
Assumptions
Same as Multiple Linear Regression, with the note that:
Variables must be standardized before fitting Lasso. The L1 penalty treats all coefficients equally, so scale affects which coefficients are shrunk.
Limitations
- Grouping Effect: If
and are correlated, Lasso will pick one arbitrarily. Use Elastic Net to keep both. - At most
features: In high-dimensional settings ( ), Lasso selects at most features. - Biased Estimates: Selected coefficients are shrunk, so their values are biased.
Python Implementation
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Fit LassoCV
lasso = LassoCV(cv=5, random_state=42).fit(X_scaled, y)
print(f"Optimal Lambda (alpha): {lasso.alpha_:.4f}")
# 3. Feature Selection Result
coef_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso.coef_})
selected = coef_df[coef_df['Coefficient'] != 0]
print(f"\n--- Selected Features ({len(selected)}/{len(X.columns)}) ---")
print(selected)
R Implementation
library(glmnet)
# 1. Prepare Matrix
X <- as.matrix(df[, -target_col])
y <- df$target
# 2. Fit Lasso with Cross-Validation (alpha = 1 for Lasso)
cv_fit <- cv.glmnet(X, y, alpha = 1)
# 3. Plot Error vs Lambda
plot(cv_fit)
# 4. Best Lambda
cat("Optimal Lambda:", cv_fit$lambda.min, "\n")
# 5. Selected Features (Non-Zero Coefficients)
coefs <- coef(cv_fit, s = "lambda.min")
selected <- coefs[coefs[, 1] != 0, , drop = FALSE]
print(selected)
Worked Numerical Example
Start: 20 features (Square_Feet, Bedrooms, Bathrooms, Age, Distance_to_School, etc.)
Lasso Results (λ = 0.05):
| Feature | OLS Coefficient | Lasso Coefficient | Selected? |
|---|---|---|---|
| Square_Feet | 150 | 142 | ✓ Yes |
| Bedrooms | 5,000 | 4,200 | ✓ Yes |
| Bathrooms | 8,000 | 6,800 | ✓ Yes |
| Age | -500 | -420 | ✓ Yes |
| Distance_to_School | -200 | 0 | ✗ Dropped |
| Garage_Size | 3,000 | 0 | ✗ Dropped |
| ... (14 more features) | ... | 0 | ✗ Dropped |
Result: Lasso selected 4 out of 20 features.
Prediction: A house with 2000 sq ft, 3 bed, 2 bath, 10 years old:
- Lasso: Price = 142(2000) + 4200(3) + 6800(2) - 420(10) = $309,400
- OLS (all 20 features): Price = $312,000 (but overfit to training noise)
Interpretation Guide
| Scenario | Interpretation | Edge Case Notes |
|---|---|---|
| 5 of 50 coefficients non-zero | Lasso identified 5 key predictors | If λ decreased, more would be selected. Try λ path plot. |
| Important variable dropped | Likely correlated with included variable | Check: If X₁ and X₂ have r=0.95, Lasso picks stronger one arbitrarily. |
| All coefficients = 0 | λ too large OR no predictive signal | Try: Reduce λ by 10×. If still all zero, data may lack signal. |
| Many correlated vars, Lasso picks 1 | Expected behavior (grouping effect) | Solution: Use Elastic Net (α=0.5) to keep correlated groups. |
| Selected feature has unexpected sign | Possible confounding or collinearity | Compare to Ridge: If sign flips, investigate correlations. |
Common Pitfall Example
Scenario: Predicting customer churn using:
- Calls_to_Support (range: 0-20)
- Support_Minutes (range: 0-300)
- (These are obviously correlated: r = 0.92)
Lasso Result:
- Calls_to_Support: β = 0.15 (Selected)
- Support_Minutes: β = 0 (Dropped!)
The Trap: You conclude "Number of calls matters, but call duration doesn't!"
Reality: Both matter, but Lasso arbitrarily picked one because they're redundant.
Test: Remove Calls_to_Support, rerun Lasso:
- Support_Minutes: β = 0.008 (Now selected!)
Solution:
- Use Elastic Net (α = 0.5) to include both
- Or create composite: "Support_Engagement_Score" = weighted average
- Or interpret: "Customer support interaction (however measured) predicts churn"
Related Concepts
- Ridge Regression - L2 penalty; no feature selection.
- Elastic Net - Combines L1 and L2; better for correlated features.
- Cross-Validation - Required for selecting
. - Feature Selection