Chi-Square Distribution

Chi-Square Distribution

Definition

Core Statement

The Chi-Square Distribution (χ2) is a continuous probability distribution defined as the sum of squared independent standard normal variables. It is characterized by its degrees of freedom and is used extensively in hypothesis testing for variance and categorical data.


Purpose

  1. Test hypotheses about population variance.
  2. Test independence and goodness-of-fit for categorical data.
  3. Form the basis of the Chi-Square Test of Independence and goodness-of-fit tests.
  4. Related to the F-Distribution in ANOVA.

When to Use

Chi-Square Distribution Appears In...


Theoretical Background

Definition

If Z1,Z2,,Zk are independent standard normal variables (ZiN(0,1)), then:

χ2=Z12+Z22++Zk2χ2(k)

The distribution is determined by a single parameter: degrees of freedom (k or df).

Properties

Property Value
Mean k (equals degrees of freedom)
Variance 2k
Skewness 8/k (right-skewed, approaches symmetry as k increases)
Support [0,) (strictly positive)

Shape Evolution

Relationship to Normal

For large df:

χ2df2dfN(0,1)

Worked Example: Testing Manufacturing Precision

Problem

A machine is supposed to fill bags with a variance of σ02=4.
You take a sample of n=15 bags and calculate a sample variance of s2=7.

Question: Is the variance significantly higher than 4? (Test at α=0.05).

Solution:

  1. Hypotheses:

    • H0:σ24
    • H1:σ2>4 (Right-tailed)
  2. Test Statistic:

    χ2=(n1)s2σ02=(14)(7)4=984=24.5
  3. Critical Value:

    • df=n1=14.
    • Lookup χ0.05,142 (right tail). Critical Value23.68.
  4. Decision:

    • Since 24.5>23.68, we Reject H0.

Conclusion: The machine's variance is significantly higher than the standard. It needs maintenance.


Assumptions

Chi-square tests using this distribution assume:


Limitations

Pitfalls

  1. Extreme Sensitivity (Variance Test): The simple Chi-square test for variance is incredibly sensitive to non-normality (even slight skew). Use Levene's Test or Bartlett's Test instead.
  2. Sample Size Dependence: With huge N, tiny deviations become significant. Always check Effect Size (e.g., Cramer's V).
  3. Low Counts: In goodness-of-fit, if bins have < 5 counts, the Chi-square approximation breaks down.


Python Implementation

from scipy.stats import chi2
import numpy as np
import matplotlib.pyplot as plt

# Define Chi-Square with df=5
df = 5
dist = chi2(df)

# Critical Value (95th percentile)
critical_value = dist.ppf(0.95)
print(f"Chi-Square Critical Value (df={df}, α=0.05): {critical_value:.3f}")

# P-value for an observed statistic
observed_stat = 11.0
p_value = 1 - dist.cdf(observed_stat)
print(f"P-value for χ² = {observed_stat}: {p_value:.4f}")

# Visualize Different Degrees of Freedom
x = np.linspace(0, 30, 500)
for df in [1, 3, 5, 10, 20]:
    plt.plot(x, chi2(df).pdf(x), label=f'df={df}')

plt.xlabel('χ²')
plt.ylabel('Density')
plt.title('Chi-Square Distribution for Various df')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

R Implementation

# Critical Value (95th percentile, df=5)
qchisq(0.95, df = 5)

# P-value for observed statistic
observed_stat <- 11.0
pchisq(observed_stat, df = 5, lower.tail = FALSE)

# Visualize
curve(dchisq(x, df = 1), from = 0, to = 30, col = "red", lwd = 2,
      ylab = "Density", xlab = "χ²", main = "Chi-Square Distributions")
curve(dchisq(x, df = 3), add = TRUE, col = "blue", lwd = 2)
curve(dchisq(x, df = 5), add = TRUE, col = "green", lwd = 2)
curve(dchisq(x, df = 10), add = TRUE, col = "purple", lwd = 2)
legend("topright", legend = c("df=1", "df=3", "df=5", "df=10"),
       col = c("red", "blue", "green", "purple"), lwd = 2)

Interpretation Guide

Output Interpretation
Output Interpretation
-------- ----------------
Value df Expected value under H0. (e.g., χ2=10 with df=10 is normal).
Value df Significant deviation. Reject H0.
Sum of Squares Intuitively, "How much total normalized error is there?"
P-value 1 Too good to be true? Check for data fraud or overfitting.