Pearson Correlation
Pearson Correlation
Definition
Pearson Correlation Coefficient (
Purpose
- Quantify the degree to which two variables move together linearly.
- Serve as a preliminary step before Simple Linear Regression.
- Identify potential multicollinearity issues in Multiple Linear Regression.
When to Use
- Both variables are continuous.
- The relationship is linear.
- Data is approximately bivariate normal.
- There are no significant outliers.
- Non-linear relationship: Consider Spearman's Rank Correlation.
- Ordinal data: Use Spearman or Kendall's Tau.
- Outliers present: Use Spearman (rank-based, more robust).
Theoretical Background
Formula
Interpretation
| Interpretation | |
|---|---|
| 1.00 | Perfect Positive Linear |
| 0.70 - 0.99 | Strong Positive |
| 0.40 - 0.69 | Moderate Positive |
| 0.10 - 0.39 | Weak Positive |
| 0.00 | No Linear Relationship |
| Negative | Mirror of above |
Coefficient of Determination ( )
: 49% of variance is shared.
Assumptions
Limitations
- Correlation
Causation: A strong does not imply causes . Confounders may exist. - Only measures LINEAR relationships: A perfect U-shaped curve gives
. - Sensitive to outliers: One extreme point can dramatically change
. - Range Restriction: Calculating
on a truncated range (e.g., only high performers) underestimates the true relationship.
Python Implementation
from scipy import stats
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7])
y = np.array([2, 4, 5, 4, 5, 7, 8])
# Pearson Correlation
r, p_val = stats.pearsonr(x, y)
print(f"Pearson r: {r:.3f}")
print(f"p-value: {p_val:.4f}")
print(f"R-squared: {r**2:.3f}")
# Confidence Interval (Fisher z-transformation)
n = len(x)
z = np.arctanh(r)
se = 1 / np.sqrt(n - 3)
z_crit = stats.norm.ppf(0.975)
ci_low = np.tanh(z - z_crit * se)
ci_high = np.tanh(z + z_crit * se)
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")
R Implementation
# Simple Correlation
cor(x, y)
# Correlation Test (with p-value and CI)
cor.test(x, y, method = "pearson")
# Correlation Matrix
cor(df[, c("var1", "var2", "var3")])
Worked Numerical Example
Data: 10 Students.
- Student A: 1 hour, Grade 60
- Student J: 10 hours, Grade 95
Result:
Interpretation: Strong positive relationship. More study time is strongly associated with higher grades.
Interpretation Guide
| Output | Interpretation | Edge Case Notes |
|---|---|---|
| Strong positive linear relationship. | ||
| No linear relationship. | Check plot: Could be U-shaped (non-linear)! | |
| Outlier driving the correlation? | Remove outlier and check if |
|
| Strong negative relationship. | As X goes up, Y goes down. |
Common Pitfall Example
Scenario: Ice cream sales vs Shark attacks (
Wrong Conclusion: "Eating ice cream causes shark attacks." (Or sharks cause ice cream cravings).
Reality: Confounding Variable: Temperature / Summer.
- Hot weather
More ice cream. - Hot weather
More people swimming More shark attacks.
Lesson:
Related Concepts
- Spearman's Rank Correlation - Non-parametric alternative.
- Kendall's Tau - For small samples, ordinal data.
- Simple Linear Regression - Models the relationship.
- Correlation vs Causation