Hypothesis Testing (P-Value & CI)

Hypothesis Testing (P-Value & CI)

Definition

Core Statement

Hypothesis Testing is a statistical method used to make inferences about a population based on sample data. It involves formulating a Null Hypothesis (H0), collecting data, computing a test statistic, and deciding whether the evidence is strong enough to reject H0 in favor of an Alternative Hypothesis (H1).


Purpose

  1. To determine whether observed data is consistent with a specific claim (e.g., "The drug has no effect").
  2. To provide a framework for making decisions under uncertainty.
  3. To quantify the strength of evidence against H0 via the p-value.

When to Use

Use Hypothesis Testing When...

  • You have a specific claim to test (e.g., "The mean is 50").
  • You want to decide between two competing hypotheses.
  • You need a standardized framework for scientific inquiry.

Limitations of NHST

  • It does not tell you the probability that H0 is true.
  • A significant p-value does not imply a large or important effect.
  • Over-reliance can lead to "p-hacking" and reproducibility issues.


Theoretical Background

The Hypotheses

Hypothesis Symbol Description
Null Hypothesis H0 The default assumption; typically "no effect" or "no difference."
Alternative Hypothesis H1 or Ha The claim we are testing for; "there is an effect."

The P-Value

Critical Definition

The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming H0 is true.

P-value=P(Observed Data or More Extreme|H0 is True)

Interpretation:

Significance Level (α)

The pre-defined threshold for rejecting H0. Commonly α=0.05 (5%).

Confidence Intervals (CI)

A Confidence Interval is a range of plausible values for the population parameter.

CI and P-Value Relationship

  • If the 95% CI for a difference contains 0, the result is not significant at α=0.05.
  • If the 95% CI for an Odds Ratio contains 1, the result is not significant.


Decision Errors

. H0 is True H0 is False
Reject H0 Type I Error (α) False Positive Correct Decision True Positive (Power = 1β)
Fail to Reject H0 Correct Decision True Negative Type II Error (β) False Negative

Assumptions


Limitations

Common Pitfalls

  1. P-value is NOT P(H0|Data). It is P(Data|H0). These are not the same (Prosecutor's Fallacy).
  2. Statistical Significance Practical Significance. A p-value of 0.001 for a tiny effect (e.g., 0.001 kg weight loss) is meaningless. Always report Effect Size Measures.
  3. Multiple Comparisons Problem: Running 20 tests at α=0.05 guarantees roughly 1 false positive by chance. Use Bonferroni Correction or FDR.
  4. Dichotomization: Treating p=0.049 as "significant" and p=0.051 as "not significant" ignores the inherent uncertainty.


Python Implementation

from scipy import stats
import numpy as np

# Scenario: We test if a coin is biased (H0: p = 0.5)
# Observed: 60 heads in 100 tosses.

# Binomial Test
result = stats.binomtest(k=60, n=100, p=0.5, alternative='two-sided')

print(f"P-value: {result.pvalue:.4f}")
print(f"95% CI for p: {result.proportion_ci(confidence_level=0.95)}")

if result.pvalue < 0.05:
    print("Reject H0: The coin is likely biased.")
else:
    print("Fail to Reject H0: Could be fair.")

R Implementation

# Scenario: 60 Heads, 100 Tosses, H0: p = 0.5
test_res <- binom.test(x = 60, n = 100, p = 0.5)

print(test_res)

# Output:
# - p-value
# - 95% Confidence Interval
# - Sample estimate of p

# Interpretation
if(test_res$p.value < 0.05) {
  cat("Reject H0: Coin is biased.\n")
} else {
  cat("Fail to Reject H0: Coin may be fair.\n")
}

Interpretation Guide

Scenario Interpretation
p = 0.03 Evidence against H0 at the 5% level. Reject H0.
p = 0.15 Not enough evidence against H0. Fail to reject.
95% CI = [1.2, 3.5] for OR The effect is significant (doesn't contain 1) and the OR is between 1.2 and 3.5.
95% CI = [-0.5, 0.8] for Diff The effect is NOT significant (contains 0).