Leverage (Hat Matrix)

Leverage (Hat Matrix)

Overview

Definition

Leverage (hii) measures how far an observation's independent variable values are from the mean of those values. High-leverage points are outliers in the X-space and have the potential to exert significant influence on the regression coefficients.

Leverage and Influence


1. Mathematical Derivation

In OLS regression (Y=Xβ+ε), the fitted values Y^ are obtained by projecting Y onto the space spanned by X:

Y^=X(XTX)1XTY=HY

The n×n matrix H=X(XTX)1XT is called the Hat Matrix because it puts the "hat" on Y.
The diagonal elements hii are the leverage values.


2. Properties


3. Identification Thresholds

A common rule of thumb identifies a point as high leverage if:

hii>2×h¯=2pn

Points exceeding 3p/n are considered extremely high leverage.


4. Worked Example: The Outlier CEO

Problem

You are modeling Income vs Age for a small town.

  • Most people (n=20) are aged 20-60, earning $30k-$100k.
  • Person X (CEO): Age = 95, Income = $50,000.

Question: Does Person X have high leverage? Is it influential?

Analysis:

  1. Check X-Space (Age):

    • Mean Age 40.
    • Person X Age = 95. This is far from the centroid.
    • Result: High Leverage (hii will be large).
  2. Check Y-Space (Residual):

    • If the model predicts Income for a 95-year-old is roughly $40k-$60k (retirement), and actual is $50k...
    • Residual is small.
  3. Conclusion:

    • Person X is a High Leverage point (extreme Age).
    • However, because the income fits the trend, it is Low Influence. It merely anchors the regression line, reducing standard errors (Good Leverage).
    • Contrast: If the CEO earned $10M, they would be High Leverage AND High Influence (pulling the slope up).

5. Assumptions


6. Limitations

Pitfalls

  1. Good vs Bad Leverage: Don't delete points just because they have high leverage! If they follow the trend, they are valuable data points that increase precision. Only remove if they are errors or fundamentally different populations.
  2. The "Masking" Effect: Two high-leverage points close to each other can mask each other's influence.
  3. Data Entry Errors: High leverage often flags typos (e.g., Age=950 instead of 95). Always check source data.


7. Leverage vs. Influence

High leverage is a necessary but insufficient condition for high influence.

See Cook's Distance for the combined measure of Influence.

Interpretation Guide

Metric Rule of Thumb Action
hii>2p/n Moderate Leverage Investigate. Check for data entry errors.
hii>3p/n Result is High Leverage Danger zone. Check Cook's Distance to see if it's influential.
1/n Minimum possible leverage Perfectly average observation X-wise.
1.0 Maximum possible leverage Parameter is determined solely by this point (DF used up).

8. Python Implementation Example

import numpy as np
import statsmodels.api as sm

# Fit Model
model = sm.OLS(y, X).fit()

# Get Influence
influence = model.get_influence()
leverage = influence.hat_matrix_diag

# Threshold
p = len(model.params)
n = len(y)
threshold = 2 * p / n

high_leverage_points = np.where(leverage > threshold)[0]
print(f"High Leverage Indices: {high_leverage_points}")