Cook's Distance

Definition

Core Statement

Cook's Distance (Cook's D) is a measure of the influence of a single data point on a regression model. It estimates how much the model's coefficients would change if that specific observation were removed.

Purpose

Identify Influential Points: Distinguish between harmless outliers and data points that drastically distort the model.
Clean Data: Decide whether to keep, remove, or investigate specific observations.

When to Use

Use Cook's D When...

You have fitted a Linear Regression or GLM.
You suspect outliers might be dominating the results.
Diagnostic plots show points unrelated to the rest of the data.

Alternatives

Leverage: Measures how extreme $X$ is, but not if it affects the fit.
Studentized Residuals: Measures how far $Y$ is from prediction, but not its influence.
Cook's D combines both.

Theoretical Background

The Formula

D_{i} = \frac{\sum_{j = 1}^{n} ({\hat{Y}}_{j} - {\hat{Y}}_{j (i)})^{2}}{p \cdot M S E}

${\hat{Y}}_{j}$ : Prediction using all data.
${\hat{Y}}_{j (i)}$ : Prediction using all data except observation $i$ .
$p$ : Number of parameters.
$M S E$ : Mean Squared Error.

Intuition: $D_{i}$ is large if removing point $i$ causes the predicted values ( $\hat{Y}$ ) to move a lot.

Thresholds

Conservative: $D_{i} > 1$ represents a highly influential point.
Sensitive: $D_{i} > 4 / n$ is a common cutoff for "worthy of investigation."

Worked Numerical Example

The Billionaire in the Neighborhood

Scenario: Predicting Home Value from Sq Ft.
Data: 20 homes.

The Outlier: A small historic mansion fit for a king.

Sq Ft: 2,000 (Average).
Price: $10,000,000 (100x average).

Influence:

OLS with Outlier: Slope = $2,000/sqft. (Distorted upwards).
OLS without Outlier: Slope = $200/sqft.

Cook's D Result:

$D_{m a n s i o n} = 8.5$ (Huge!).
Action: This single point is totally changing the model. Remove it or fit a robust model.

Assumptions

Linear Model context.
Valid OLS estimation.

Limitations

Pitfalls

Masking Effect: Two outliers near each other can "hide" each other's influence.
Swamping: A cluster of good points can make a valid extreme point look like an outlier.
Automatic Removal: High Cook's D $\neq$ "Delete this data". It means "Investigate this data". It might be a data entry error, or it might be the most interesting discovery (e.g., a new phenomenon).

Python Implementation

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Fit model
model = sm.OLS(y, X).fit()

# Calculate Cook's Distance
influence = model.get_influence()
cooks_d = influence.cooks_distance[0] # [0] is values, [1] is p-values

# Plot
plt.stem(np.arange(len(cooks_d)), cooks_d, markerfmt=",")
plt.title("Cook's Distance Plot")
plt.xlabel("Observation Index")
plt.ylabel("Cook's Distance")
plt.show()

# Identify Influential Points
n = len(X)
threshold = 4/n
influential_points = np.where(cooks_d > threshold)[0]
print(f"Influential Indices: {influential_points}")

R Implementation

# Fit Model
model <- lm(Price ~ SqFt, data = houses)

# Cook's Distance
cooks <- cooks.distance(model)

# Plot
plot(model, 4) # Plot 4 is Cook's D
abline(h = 4/nrow(houses), col="red")

# Extract Indices
which(cooks > 4/nrow(houses))

Interpretation Guide

Output	Interpretation	Action
$D_{i} = 0.01$	Negligible influence.	Keep.
$D_{i} = 0.8$	Moderate influence.	Inspect. Is it a valid data point?
$D_{i} = 5.2$	Massive influence.	Critical: Model is unstable. Re-run model without this point to see impact.

Residual Analysis
Leverage (Hat Matrix) - Potential for influence.
Robust Regression - Alternative that downweights outliers (e.g., RANSAC, Huber).
VIF (Variance Inflation Factor)

Cook's Distance

Definition

Purpose

When to Use

Theoretical Background

The Formula

Thresholds

Worked Numerical Example

Assumptions

Limitations

Python Implementation

R Implementation

Interpretation Guide

Related Concepts