Cook's Distance
Cook's Distance
Definition
Core Statement
Cook's Distance (Cook's D) is a measure of the influence of a single data point on a regression model. It estimates how much the model's coefficients would change if that specific observation were removed.
Purpose
- Identify Influential Points: Distinguish between harmless outliers and data points that drastically distort the model.
- Clean Data: Decide whether to keep, remove, or investigate specific observations.
When to Use
Use Cook's D When...
- You have fitted a Linear Regression or GLM.
- You suspect outliers might be dominating the results.
- Diagnostic plots show points unrelated to the rest of the data.
Alternatives
- Leverage: Measures how extreme
is, but not if it affects the fit. - Studentized Residuals: Measures how far
is from prediction, but not its influence. - Cook's D combines both.
Theoretical Background
The Formula
: Prediction using all data. : Prediction using all data except observation . : Number of parameters. : Mean Squared Error.
Intuition:
Thresholds
- Conservative:
represents a highly influential point. - Sensitive:
is a common cutoff for "worthy of investigation."
Worked Numerical Example
The Billionaire in the Neighborhood
Scenario: Predicting Home Value from Sq Ft.
Data: 20 homes.
The Outlier: A small historic mansion fit for a king.
- Sq Ft: 2,000 (Average).
- Price: $10,000,000 (100x average).
Influence:
- OLS with Outlier: Slope = $2,000/sqft. (Distorted upwards).
- OLS without Outlier: Slope = $200/sqft.
Cook's D Result:
(Huge!). - Action: This single point is totally changing the model. Remove it or fit a robust model.
Assumptions
Limitations
Pitfalls
- Masking Effect: Two outliers near each other can "hide" each other's influence.
- Swamping: A cluster of good points can make a valid extreme point look like an outlier.
- Automatic Removal: High Cook's D
"Delete this data". It means "Investigate this data". It might be a data entry error, or it might be the most interesting discovery (e.g., a new phenomenon).
Python Implementation
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Fit model
model = sm.OLS(y, X).fit()
# Calculate Cook's Distance
influence = model.get_influence()
cooks_d = influence.cooks_distance[0] # [0] is values, [1] is p-values
# Plot
plt.stem(np.arange(len(cooks_d)), cooks_d, markerfmt=",")
plt.title("Cook's Distance Plot")
plt.xlabel("Observation Index")
plt.ylabel("Cook's Distance")
plt.show()
# Identify Influential Points
n = len(X)
threshold = 4/n
influential_points = np.where(cooks_d > threshold)[0]
print(f"Influential Indices: {influential_points}")
R Implementation
# Fit Model
model <- lm(Price ~ SqFt, data = houses)
# Cook's Distance
cooks <- cooks.distance(model)
# Plot
plot(model, 4) # Plot 4 is Cook's D
abline(h = 4/nrow(houses), col="red")
# Extract Indices
which(cooks > 4/nrow(houses))
Interpretation Guide
| Output | Interpretation | Action |
|---|---|---|
| Negligible influence. | Keep. | |
| Moderate influence. | Inspect. Is it a valid data point? | |
| Massive influence. | Critical: Model is unstable. Re-run model without this point to see impact. |
Related Concepts
- Residual Analysis
- Leverage (Hat Matrix) - Potential for influence.
- Robust Regression - Alternative that downweights outliers (e.g., RANSAC, Huber).
- VIF (Variance Inflation Factor)