Cook's Distance

Cook's Distance

Definition

Core Statement

Cook's Distance (Cook's D) is a measure of the influence of a single data point on a regression model. It estimates how much the model's coefficients would change if that specific observation were removed.


Purpose

  1. Identify Influential Points: Distinguish between harmless outliers and data points that drastically distort the model.
  2. Clean Data: Decide whether to keep, remove, or investigate specific observations.

When to Use

Use Cook's D When...

  • You have fitted a Linear Regression or GLM.
  • You suspect outliers might be dominating the results.
  • Diagnostic plots show points unrelated to the rest of the data.

Alternatives

  • Leverage: Measures how extreme X is, but not if it affects the fit.
  • Studentized Residuals: Measures how far Y is from prediction, but not its influence.
  • Cook's D combines both.


Theoretical Background

The Formula

Di=j=1n(Y^jY^j(i))2pMSE

Intuition: Di is large if removing point i causes the predicted values (Y^) to move a lot.

Thresholds


Worked Numerical Example

The Billionaire in the Neighborhood

Scenario: Predicting Home Value from Sq Ft.
Data: 20 homes.

The Outlier: A small historic mansion fit for a king.

  • Sq Ft: 2,000 (Average).
  • Price: $10,000,000 (100x average).

Influence:

  • OLS with Outlier: Slope = $2,000/sqft. (Distorted upwards).
  • OLS without Outlier: Slope = $200/sqft.

Cook's D Result:

  • Dmansion=8.5 (Huge!).
  • Action: This single point is totally changing the model. Remove it or fit a robust model.

Assumptions


Limitations

Pitfalls

  1. Masking Effect: Two outliers near each other can "hide" each other's influence.
  2. Swamping: A cluster of good points can make a valid extreme point look like an outlier.
  3. Automatic Removal: High Cook's D "Delete this data". It means "Investigate this data". It might be a data entry error, or it might be the most interesting discovery (e.g., a new phenomenon).


Python Implementation

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Fit model
model = sm.OLS(y, X).fit()

# Calculate Cook's Distance
influence = model.get_influence()
cooks_d = influence.cooks_distance[0] # [0] is values, [1] is p-values

# Plot
plt.stem(np.arange(len(cooks_d)), cooks_d, markerfmt=",")
plt.title("Cook's Distance Plot")
plt.xlabel("Observation Index")
plt.ylabel("Cook's Distance")
plt.show()

# Identify Influential Points
n = len(X)
threshold = 4/n
influential_points = np.where(cooks_d > threshold)[0]
print(f"Influential Indices: {influential_points}")

R Implementation

# Fit Model
model <- lm(Price ~ SqFt, data = houses)

# Cook's Distance
cooks <- cooks.distance(model)

# Plot
plot(model, 4) # Plot 4 is Cook's D
abline(h = 4/nrow(houses), col="red")

# Extract Indices
which(cooks > 4/nrow(houses))

Interpretation Guide

Output Interpretation Action
Di=0.01 Negligible influence. Keep.
Di=0.8 Moderate influence. Inspect. Is it a valid data point?
Di=5.2 Massive influence. Critical: Model is unstable. Re-run model without this point to see impact.