P-Hacking

Definition

Core Statement

P-Hacking (also known as Data Dredging or Significance Chasing) is the misuse of data analysis to find patterns in data that can be presented as statistically significant ( $p < 0.05$ ), when in reality, no such underlying effect exists. This dramatically increases False Positives (Type I Error).

Purpose

Awareness: Understand why many published scientific results ("The Reproducibility Crisis") are false.
Prevention: Learn ethical data science practices (e.g., Pre-registration).
Critical Reading: Spotting suspicious results in papers (e.g., p-values clustered just below 0.05).

Common Tactics

Stop-Peeking: Testing the data every day and stopping collection the moment $p < 0.05$ . (This inflates error rates by 4-5x).
Sub-Group Analysis: "The drug didn't work overall, but let's test Men vs Women, Old vs Young, Left-handed vs Right-handed..." until something pops up.
Variable Fishing: Collecting 100 metrics and reporting the 1 that "worked" (ignoring the 99 that failed).
Metric Flipping: "We planned to measure Accuracy, but Recall looks better, so let's report that."

Conceptual Example: The Jelly Bean Study

xkcd Experiment

H0: Jelly beans do not cause acne.

Test overall: $p > 0.05$ . (Fail to reject).
Hack: Test every color:
- Purple? No. (p=0.12)
- Brown? No. (p=0.45)
- ...
- Green? Yes! ( $p < 0.05$ ).

Report: "Green Jelly Beans Linked to Acne (95% Confidence)!"
Reality: You ran 20 tests. The probability of getting one false positive purely by chance is very high ( $1 - {0.95}^{20} \approx 64 %$ ).

Prevention

Best Practices

Pre-registration: Decide your hypothesis, sample size, and metrics before looking at the data.
Bonferroni Correction: If testing 20 hypotheses, divide threshold by 20 ( $α = 0.05 / 20 = 0.0025$ ).
Hold-out Set: Find patterns in Training set, verify them in Test set. If it was hacking, it won't generalize.

Python Simulation

import numpy as np
import scipy.stats as stats

# Simulating 1000 "useless" experiments (True Null)
# We expect ~50 to be significant by chance (5%)

p_values = []
for _ in range(1000):
    # Two random groups, no real difference
    group_a = np.random.normal(0, 1, 30)
    group_b = np.random.normal(0, 1, 30)
    
    _, p = stats.ttest_ind(group_a, group_b)
    p_values.append(p)

significant_count = sum(p < 0.05 for p in p_values)
print(f"False Positives: {significant_count} / 1000 ({significant_count/10}%)")
# Usually around 5%.

# Now imagine reporting ONLY those 50... that is P-Hacking.