Chi-Square Distribution
Chi-Square Distribution
Definition
The Chi-Square Distribution (
Purpose
- Test hypotheses about population variance.
- Test independence and goodness-of-fit for categorical data.
- Form the basis of the Chi-Square Test of Independence and goodness-of-fit tests.
- Related to the F-Distribution in ANOVA.
When to Use
- Chi-Square Test of Independence - Testing association between categorical variables.
- Goodness-of-Fit Test - Does data fit a theoretical distribution?
- Variance Test - Testing if sample variance equals a hypothesized value.
- Heteroscedasticity Tests (Breusch-Pagan Test, White Test).
Theoretical Background
Definition
If
The distribution is determined by a single parameter: degrees of freedom (
Properties
| Property | Value |
|---|---|
| Mean | |
| Variance | |
| Skewness | |
| Support |
Shape Evolution
- Low df (e.g., 1-2): Extremely right-skewed, peak near 0.
- Medium df (e.g., 10): Moderately skewed.
- High df (e.g., 30+): Approximately normal due to Central Limit Theorem (CLT).
Relationship to Normal
For large
Worked Example: Testing Manufacturing Precision
A machine is supposed to fill bags with a variance of
You take a sample of
Question: Is the variance significantly higher than 4? (Test at
Solution:
-
Hypotheses:
(Right-tailed)
-
Test Statistic:
-
Critical Value:
. - Lookup
(right tail). .
-
Decision:
- Since
, we Reject .
- Since
Conclusion: The machine's variance is significantly higher than the standard. It needs maintenance.
Assumptions
Chi-square tests using this distribution assume:
Limitations
- Extreme Sensitivity (Variance Test): The simple Chi-square test for variance is incredibly sensitive to non-normality (even slight skew). Use Levene's Test or Bartlett's Test instead.
- Sample Size Dependence: With huge
, tiny deviations become significant. Always check Effect Size (e.g., Cramer's V). - Low Counts: In goodness-of-fit, if bins have < 5 counts, the Chi-square approximation breaks down.
Python Implementation
from scipy.stats import chi2
import numpy as np
import matplotlib.pyplot as plt
# Define Chi-Square with df=5
df = 5
dist = chi2(df)
# Critical Value (95th percentile)
critical_value = dist.ppf(0.95)
print(f"Chi-Square Critical Value (df={df}, α=0.05): {critical_value:.3f}")
# P-value for an observed statistic
observed_stat = 11.0
p_value = 1 - dist.cdf(observed_stat)
print(f"P-value for χ² = {observed_stat}: {p_value:.4f}")
# Visualize Different Degrees of Freedom
x = np.linspace(0, 30, 500)
for df in [1, 3, 5, 10, 20]:
plt.plot(x, chi2(df).pdf(x), label=f'df={df}')
plt.xlabel('χ²')
plt.ylabel('Density')
plt.title('Chi-Square Distribution for Various df')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
R Implementation
# Critical Value (95th percentile, df=5)
qchisq(0.95, df = 5)
# P-value for observed statistic
observed_stat <- 11.0
pchisq(observed_stat, df = 5, lower.tail = FALSE)
# Visualize
curve(dchisq(x, df = 1), from = 0, to = 30, col = "red", lwd = 2,
ylab = "Density", xlab = "χ²", main = "Chi-Square Distributions")
curve(dchisq(x, df = 3), add = TRUE, col = "blue", lwd = 2)
curve(dchisq(x, df = 5), add = TRUE, col = "green", lwd = 2)
curve(dchisq(x, df = 10), add = TRUE, col = "purple", lwd = 2)
legend("topright", legend = c("df=1", "df=3", "df=5", "df=10"),
col = c("red", "blue", "green", "purple"), lwd = 2)
Interpretation Guide
| Output | Interpretation |
|---|---|
| Output | Interpretation |
| -------- | ---------------- |
| Value |
Expected value under |
| Value |
Significant deviation. Reject |
| Sum of Squares | Intuitively, "How much total normalized error is there?" |
| P-value |
Too good to be true? Check for data fraud or overfitting. |
Related Concepts
- Chi-Square Test of Independence
- Normal Distribution -
is the sum of squared normals. - F-Distribution - Ratio of two chi-squares.
- Degrees of Freedom
- Goodness-of-Fit Test