Quantile Regression

Quantile Regression

Definition

Core Statement

Quantile Regression extends linear regression to model the conditional quantiles (e.g., median, 90th percentile) of the response variable, rather than just the conditional mean. It provides a more complete view of possible causal relationships between variables.


Purpose

  1. Robustness (Median Regression): The 50th percentile (Median) is robust to extreme outliers, unlike the Mean (OLS).
  2. Heteroscedasticity Analysis: Understand how factors affect the spread or tails of the distribution.
  3. Risk Analysis: Model the "Worst Case" (e.g., 99th percentile of Loss) rather than the average case.

When to Use

Use Quantile Regression When...

  • Data has Outliers: OLS results are distorted by extreme values.
  • Heteroscedasticity: Variance is not constant (OLS assumption violated).
  • Interest in Tails: You care about the high-performers or the at-risk group, not the average. (e.g., "What effects birth weight in low-weight infants?").


Worked Example: Income Inequality

Problem

Does Education increase Income equally for everyone?
Run OLS vs Quantile Regression (τ=0.1,0.5,0.9).

Results:

Interpretation:
Education has a much higher payoff for high-earners (perhaps due to elite schools/networks). OLS missed this nuance by averaging everything into a single number.


Theoretical Background

Loss Function


Assumptions


Limitations

Pitfalls

  1. Crossing Quantiles: Sometimes lines can cross (e.g., predicting 90th percentile < 50th percentile) due to lack of constraints. This is mathematically impossible and indicates insufficient data in that region.
  2. Computation: Slower than OLS (requires Linear Programming, not simple Matrix Algebra).
  3. Sample Size: Tails (τ=0.99) require very large samples to estimate stable coefficients.


Python Implementation

import statsmodels.formula.api as smf
import pandas as pd

# Load Data
df = pd.read_csv('salary_data.csv')

# 1. OLS (Mean)
model_ols = smf.ols('Income ~ Education', df).fit()
print(f"OLS Coeff: {model_ols.params['Education']:.2f}")

# 2. Median Regression (tau=0.5)
model_med = smf.quantreg('Income ~ Education', df).fit(q=0.5)
print(f"Median Coeff: {model_med.params['Education']:.2f}")

# 3. 90th Percentile
model_90 = smf.quantreg('Income ~ Education', df).fit(q=0.9)
print(f"90th %ile Coeff: {model_90.params['Education']:.2f}")

# Comparison gives insight into heterogeneity.