Hypothesis Testing (P-Value & CI)

Definition

Core Statement

Hypothesis Testing is a statistical method used to make inferences about a population based on sample data. It involves formulating a Null Hypothesis ( $H_{0}$ ), collecting data, computing a test statistic, and deciding whether the evidence is strong enough to reject $H_{0}$ in favor of an Alternative Hypothesis ( $H_{1}$ ).

Purpose

To determine whether observed data is consistent with a specific claim (e.g., "The drug has no effect").
To provide a framework for making decisions under uncertainty.
To quantify the strength of evidence against $H_{0}$ via the p-value.

When to Use

Use Hypothesis Testing When...

You have a specific claim to test (e.g., "The mean is 50").
You want to decide between two competing hypotheses.
You need a standardized framework for scientific inquiry.

Limitations of NHST

It does not tell you the probability that $H_{0}$ is true.
A significant p-value does not imply a large or important effect.
Over-reliance can lead to "p-hacking" and reproducibility issues.

Theoretical Background

The Hypotheses

Hypothesis	Symbol	Description
Null Hypothesis	$H_{0}$	The default assumption; typically "no effect" or "no difference."
Alternative Hypothesis	$H_{1}$ or $H_{a}$	The claim we are testing for; "there is an effect."

The P-Value

Critical Definition

The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming $H_{0}$ is true.

P -value = P (Observed Data or More Extreme | H_{0} is True)

Interpretation:

Small p-value ( $< α$ ): The observed data is unlikely under $H_{0}$ . This is surprising. Reject $H_{0}$ .
Large p-value ( $\geq α$ ): The observed data is consistent with $H_{0}$ . Not surprising. Fail to reject $H_{0}$ .

Significance Level ( $α$ )

The pre-defined threshold for rejecting $H_{0}$ . Commonly $α = 0.05$ (5%).

Confidence Intervals (CI)

A Confidence Interval is a range of plausible values for the population parameter.

A 95% CI means: "If we repeated this experiment many times, 95% of the calculated intervals would contain the true parameter."

CI and P-Value Relationship

If the 95% CI for a difference contains 0, the result is not significant at $α = 0.05$ .
If the 95% CI for an Odds Ratio contains 1, the result is not significant.

Decision Errors

.	$H_{0}$ is True	$H_{0}$ is False
Reject $H_{0}$	Type I Error ( $α$ ) False Positive	Correct Decision True Positive (Power = $1 - β$ )
Fail to Reject $H_{0}$	Correct Decision True Negative	Type II Error ( $β$ ) False Negative

Type I Error ( $α$ ): Finding an effect that doesn't exist.
Type II Error ( $β$ ): Missing an effect that does exist.
Power ( $1 - β$ ): The probability of correctly detecting a true effect. Aim for Power $\geq 0.80$ .

Assumptions

Random Sampling: The sample must be representative of the population.
Independence: Observations are independent of each other.
Test-Specific Assumptions: Each test has its own (e.g., normality for t-tests).

Limitations

Common Pitfalls

P-value is NOT $P (H_{0} | Data)$ . It is $P (Data | H_{0})$ . These are not the same (Prosecutor's Fallacy).
Statistical Significance $\neq$ Practical Significance. A p-value of 0.001 for a tiny effect (e.g., 0.001 kg weight loss) is meaningless. Always report Effect Size Measures.
Multiple Comparisons Problem: Running 20 tests at $α = 0.05$ guarantees roughly 1 false positive by chance. Use Bonferroni Correction or FDR.
Dichotomization: Treating $p = 0.049$ as "significant" and $p = 0.051$ as "not significant" ignores the inherent uncertainty.

Python Implementation

from scipy import stats
import numpy as np

# Scenario: We test if a coin is biased (H0: p = 0.5)
# Observed: 60 heads in 100 tosses.

# Binomial Test
result = stats.binomtest(k=60, n=100, p=0.5, alternative='two-sided')

print(f"P-value: {result.pvalue:.4f}")
print(f"95% CI for p: {result.proportion_ci(confidence_level=0.95)}")

if result.pvalue < 0.05:
    print("Reject H0: The coin is likely biased.")
else:
    print("Fail to Reject H0: Could be fair.")

R Implementation

# Scenario: 60 Heads, 100 Tosses, H0: p = 0.5
test_res <- binom.test(x = 60, n = 100, p = 0.5)

print(test_res)

# Output:
# - p-value
# - 95% Confidence Interval
# - Sample estimate of p

# Interpretation
if(test_res$p.value < 0.05) {
  cat("Reject H0: Coin is biased.\n")
} else {
  cat("Fail to Reject H0: Coin may be fair.\n")
}

Interpretation Guide

Scenario	Interpretation
p = 0.03	Evidence against $H_{0}$ at the 5% level. Reject $H_{0}$ .
p = 0.15	Not enough evidence against $H_{0}$ . Fail to reject.
95% CI = [1.2, 3.5] for OR	The effect is significant (doesn't contain 1) and the OR is between 1.2 and 3.5.
95% CI = [-0.5, 0.8] for Diff	The effect is NOT significant (contains 0).

Type I & Type II Errors
Power Analysis
Confidence Intervals
Effect Size Measures
Bonferroni Correction
Bayesian Statistics - An alternative framework.