Correlation vs Causation

Correlation vs Causation

Definition

Core Statement

Correlation means two variables are statistically associated—they tend to change together. Causation means one variable directly causes changes in another. The critical insight: Correlation does NOT imply causation.


Purpose

  1. Prevent spurious conclusions from observational data.
  2. Understand when causal claims are justified vs when they are not.
  3. Guide the design of experiments to establish causality.

When to Use

Correlation is Appropriate When...

  • Describing patterns in data.
  • Exploring potential relationships for further investigation.
  • Prediction (even without causation).

Causation Requires...


Theoretical Background

Bradford Hill Criteria for Causation

Criterion Description
Strength Strong associations are more likely causal.
Consistency Observed repeatedly in different settings.
Specificity Effect is specific to the exposure.
Temporality Cause must precede effect. (Essential).
Dose-Response Larger exposure larger effect.
Biological Plausibility Mechanism makes sense.
Experimental Evidence Randomized trials support causation.

Common Confounders

Third Variable Problem: X and Y are correlated because both are caused by Z.

Ice Cream and Drowning

  • Correlation: Ice cream sales and drowning deaths are positively correlated.
  • Confounder: Temperature (hot weather causes both).
  • Conclusion: Ice cream does NOT cause drowning.

Reverse Causation

X may correlate with Y because Y causes X, not the other way around.

Exercise and Health

  • Naive Claim: Exercise causes good health.
  • Reverse: Healthy people are more able to exercise.
  • Reality: Likely bidirectional.


Establishing Causation

Method Strength Limitation
Randomized Controlled Trial (RCT) Gold Standard. Random assignment breaks confounding. Expensive, ethically limited.
Instrumental Variables (IV) Identifies causal effect with natural experiment. Finding valid instruments is hard.
Propensity Score Matching (PSM) Balances observed confounders. Cannot control unobserved confounders.
Difference-in-Differences (DiD) Removes time-invariant confounders. Assumes parallel trends.
Longitudinal Studies Temporal ordering clarifies direction. Still vulnerable to confounding.

Limitations

Pitfalls

  1. "Correlation = 0" does NOT mean "No Relationship." Non-linear relationships (e.g., U-shaped) can have zero Pearson correlation.
  2. Strong correlation can still be spurious. High r without mechanism is meaningless.
  3. Causal language is often misused. Media often says "causes" when they mean "is associated with."


Python Implementation

import numpy as np
import matplotlib.pyplot as plt

# Simulate Spurious Correlation
np.random.seed(42)
n = 100

# Confounder: Temperature
temperature = np.random.uniform(60, 100, n)

# Ice cream sales (caused by temperature)
ice_cream = 10 + 2 * temperature + np.random.normal(0, 10, n)

# Drowning deaths (caused by temperature)
drowning = -5 + 0.5 * temperature + np.random.normal(0, 5, n)

# Correlation between ice cream and  drowning
from scipy.stats import pearsonr
r, p = pearsonr(ice_cream, drowning)
print(f"Correlation (Ice Cream, Drowning): r = {r:.2f}, p = {p:.4f}")

# Visualization
plt.scatter(ice_cream, drowning, alpha=0.6)
plt.xlabel("Ice Cream Sales")
plt.ylabel("Drowning Deaths")
plt.title(f"Spurious Correlation (r = {r:.2f})\nConfounder: Temperature")
plt.show()

R Implementation

set.seed(42)
n <- 100

# Confounder
temperature <- runif(n, 60, 100)

# Ice cream ~ temperature
ice_cream <- 10 + 2 * temperature + rnorm(n, 0, 10)

# Drowning ~ temperature
drowning <- -5 + 0.5 * temperature + rnorm(n, 0, 5)

# Correlation
cor.test(ice_cream, drowning)

# Plot
plot(ice_cream, drowning, main = "Spurious Correlation",
     xlab = "Ice Cream Sales", ylab = "Drowning Deaths")

Interpretation Guide

Statement Valid?
"Smoking is correlated with lung cancer." ✅ Valid (descriptive).
"Smoking causes lung cancer." ✅ Valid (supported by RCTs, biological mechanism, Bradford Hill).
"Ice cream causes drowning." ❌ Spurious correlation (confounded by temperature).
"Higher education correlates with higher income." ✅ Valid correlation (but confounders exist: ability, family background).