Pearson Correlation

Pearson Correlation

Definition

Core Statement

Pearson Correlation Coefficient (r) measures the strength and direction of the linear relationship between two continuous variables. It ranges from -1 (perfect negative linear) to +1 (perfect positive linear), with 0 indicating no linear relationship.


Purpose

  1. Quantify the degree to which two variables move together linearly.
  2. Serve as a preliminary step before Simple Linear Regression.
  3. Identify potential multicollinearity issues in Multiple Linear Regression.

When to Use

Use Pearson When...

  • Both variables are continuous.
  • The relationship is linear.
  • Data is approximately bivariate normal.
  • There are no significant outliers.

Alternatives


Theoretical Background

Formula

r=(XiX¯)(YiY¯)(XiX¯)2(YiY¯)2=Cov(X,Y)σXσY

Interpretation

r Value Interpretation
1.00 Perfect Positive Linear
0.70 - 0.99 Strong Positive
0.40 - 0.69 Moderate Positive
0.10 - 0.39 Weak Positive
0.00 No Linear Relationship
Negative Mirror of above

Coefficient of Determination (r2)

r2 represents the proportion of variance in Y explained by X.


Assumptions


Limitations

Pitfalls

  1. Correlation Causation: A strong r does not imply X causes Y. Confounders may exist.
  2. Only measures LINEAR relationships: A perfect U-shaped curve gives r0.
  3. Sensitive to outliers: One extreme point can dramatically change r.
  4. Range Restriction: Calculating r on a truncated range (e.g., only high performers) underestimates the true relationship.


Python Implementation

from scipy import stats
import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
y = np.array([2, 4, 5, 4, 5, 7, 8])

# Pearson Correlation
r, p_val = stats.pearsonr(x, y)

print(f"Pearson r: {r:.3f}")
print(f"p-value: {p_val:.4f}")
print(f"R-squared: {r**2:.3f}")

# Confidence Interval (Fisher z-transformation)
n = len(x)
z = np.arctanh(r)
se = 1 / np.sqrt(n - 3)
z_crit = stats.norm.ppf(0.975)
ci_low = np.tanh(z - z_crit * se)
ci_high = np.tanh(z + z_crit * se)
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")

R Implementation

# Simple Correlation
cor(x, y)

# Correlation Test (with p-value and CI)
cor.test(x, y, method = "pearson")

# Correlation Matrix
cor(df[, c("var1", "var2", "var3")])

Worked Numerical Example

Study Time vs Grades

Data: 10 Students.

  • Student A: 1 hour, Grade 60
  • Student J: 10 hours, Grade 95

Result: r=0.88,p<0.001.
Interpretation: Strong positive relationship. More study time is strongly associated with higher grades.
R2: 0.882=0.77. Study time explains 77% of the variation in grades.


Interpretation Guide

Output Interpretation Edge Case Notes
r=0.85, p<0.001 Strong positive linear relationship.
r=0.02, p=0.8 No linear relationship. Check plot: Could be U-shaped (non-linear)!
r=0.9, but outlier exists Outlier driving the correlation? Remove outlier and check if r drops to 0.2.
r=0.7 Strong negative relationship. As X goes up, Y goes down.

Common Pitfall Example

The Correlation != Causation Classic

Scenario: Ice cream sales vs Shark attacks (r=0.95).

Wrong Conclusion: "Eating ice cream causes shark attacks." (Or sharks cause ice cream cravings).

Reality: Confounding Variable: Temperature / Summer.

  • Hot weather More ice cream.
  • Hot weather More people swimming More shark attacks.

Lesson: r only measures association. Causality requires experiment or causal inference methods.