Quantile Regression
Quantile Regression
Definition
Core Statement
Quantile Regression extends linear regression to model the conditional quantiles (e.g., median, 90th percentile) of the response variable, rather than just the conditional mean. It provides a more complete view of possible causal relationships between variables.
Purpose
- Robustness (Median Regression): The 50th percentile (Median) is robust to extreme outliers, unlike the Mean (OLS).
- Heteroscedasticity Analysis: Understand how factors affect the spread or tails of the distribution.
- Risk Analysis: Model the "Worst Case" (e.g., 99th percentile of Loss) rather than the average case.
When to Use
Use Quantile Regression When...
- Data has Outliers: OLS results are distorted by extreme values.
- Heteroscedasticity: Variance is not constant (OLS assumption violated).
- Interest in Tails: You care about the high-performers or the at-risk group, not the average. (e.g., "What effects birth weight in low-weight infants?").
Worked Example: Income Inequality
Problem
Does Education increase Income equally for everyone?
Run OLS vs Quantile Regression (
Results:
- OLS (Mean): Each year of education adds $5,000.
- QR (
- Low Earners): Adds $1,000. - QR (
- Median): Adds $4,000. - QR (
- High Earners): Adds $15,000.
Interpretation:
Education has a much higher payoff for high-earners (perhaps due to elite schools/networks). OLS missed this nuance by averaging everything into a single number.
Theoretical Background
Loss Function
- OLS minimizes sum of squared residuals:
. - Quantile Regression minimizes sum of weighted absolute residuals:
Where is the "Check Function" (tilted absolute value). For Median ( ), this is just Mean Absolute Error (MAE).
Assumptions
Limitations
Pitfalls
- Crossing Quantiles: Sometimes lines can cross (e.g., predicting 90th percentile < 50th percentile) due to lack of constraints. This is mathematically impossible and indicates insufficient data in that region.
- Computation: Slower than OLS (requires Linear Programming, not simple Matrix Algebra).
- Sample Size: Tails (
) require very large samples to estimate stable coefficients.
Python Implementation
import statsmodels.formula.api as smf
import pandas as pd
# Load Data
df = pd.read_csv('salary_data.csv')
# 1. OLS (Mean)
model_ols = smf.ols('Income ~ Education', df).fit()
print(f"OLS Coeff: {model_ols.params['Education']:.2f}")
# 2. Median Regression (tau=0.5)
model_med = smf.quantreg('Income ~ Education', df).fit(q=0.5)
print(f"Median Coeff: {model_med.params['Education']:.2f}")
# 3. 90th Percentile
model_90 = smf.quantreg('Income ~ Education', df).fit(q=0.9)
print(f"90th %ile Coeff: {model_90.params['Education']:.2f}")
# Comparison gives insight into heterogeneity.
Related Concepts
- Simple Linear Regression - The baseline (Mean).
- Descriptive Statistics - Median vs Mean.
- Heteroscedasticity - The problem QR solves.
- Loss Function - L1 (Absolute) vs L2 (Squared).