Instrumental Variables (IV)
Instrumental Variables (IV)
Definition
Core Statement
Instrumental Variables (IV) is a method to estimate causal effects when there is endogeneity (correlation between a predictor and the error term) due to omitted variables, measurement error, or simultaneity. It uses a third variable (instrument) that affects the outcome only through the endogenous predictor.
Purpose
- Estimate causal relationships in observational data when experiments are impossible.
- Address omitted variable bias and reverse causality.
- Isolate the "clean" variation in the endogenous variable.
When to Use
Use IV When...
- You suspect endogeneity (predictor is correlated with error).
- You have a valid instrument.
- OLS would give biased and inconsistent estimates.
Limitations
- Finding a valid instrument is extremely difficult.
- Weak instruments lead to biased IV estimates (worse than OLS).
Theoretical Background
The Problem
In the model
Example: Effect of Education on Wages.
- Endogeneity: Ability affects both Education and Wages.
. - OLS Bias: The coefficient on Education captures Ability, not just Education's effect.
The Solution: Instrument ( )
An instrument
- Relevance:
is correlated with . ( ). - Exclusion Restriction:
affects only through . ( ).
Example Instrument: Distance to nearest college.
- Relevant: Distance affects Education (closer = more likely to attend).
- Exclusion: Distance does not directly affect Wages (only through Education).
Two-Stage Least Squares (2SLS)
- Stage 1: Regress
on (and controls). Obtain . - Stage 2: Regress
on . Coefficient is the IV estimate.
Assumptions
Limitations
Pitfalls
- Weak Instruments: If F-stat < 10, IV estimate is biased and unreliable. Can be worse than biased OLS.
- Exclusion Restriction is Untestable: You must justify it theoretically.
- Local Average Treatment Effect (LATE): IV estimates the effect for compliers (those whose X is affected by Z), not the entire population.
Python Implementation
from linearmodels.iv import IV2SLS
import pandas as pd
# Data: Y = Wage, X = Education (Endog), Z = Distance (Instrument), W = Experience (Control)
# Formula: dependent ~ exogenous + [endogenous ~ instruments]
model = IV2SLS.from_formula('Wage ~ 1 + Experience + [Education ~ Distance]', data=df)
results = model.fit()
print(results.summary)
# First-Stage Diagnostics
print(results.first_stage.diagnostics)
# Check F-statistic > 10
R Implementation
library(AER)
# Formula: Y ~ Exog + Endog | Exog + Instruments
model <- ivreg(Wage ~ Experience + Education | Experience + Distance, data = df)
summary(model, diagnostics = TRUE)
# Diagnostics:
# - Weak Instruments: F-stat should be > 10.
# - Wu-Hausman: Tests if OLS is consistent. If significant, IV is needed.
Interpretation Guide
| Output | Interpretation |
|---|---|
| IV Coef (Education) = 0.08 | Each additional year of education increases wages by 8%, controlling for endogeneity. |
| First-Stage F = 45 | Strong instrument. IV estimate is reliable. |
| First-Stage F = 3 | Weak instrument. IV estimate is biased. Do not trust. |
| Wu-Hausman p < 0.05 | OLS was biased. IV is necessary. |
Related Concepts
- Multiple Linear Regression - The biased OLS baseline.
- Propensity Score Matching (PSM) - Alternative for selection bias.
- Difference-in-Differences (DiD)
- Regression Discontinuity Design (RDD)