F-Distribution

Definition

Core Statement

The F-Distribution is a continuous probability distribution that arises as the ratio of two chi-square distributions divided by their respective degrees of freedom. It is the foundation for ANOVA, regression F-tests, and variance ratio tests.

Purpose

Test equality of variances (Levene's Test uses a related statistic).
Test overall significance of regression models.
Compare variance explained by groups in One-Way ANOVA.
Basis for the F-statistic in multiple testing scenarios.

When to Use

F-Distribution Appears In...

One-Way ANOVA - Testing if group means differ.
Multiple Linear Regression - Overall model F-test.
Variance Ratio Test - Comparing two sample variances.
Welch's ANOVA - Modified F-test for unequal variances.

Theoretical Background

Definition

If $U \sim χ^{2} (d_{1})$ and $V \sim χ^{2} (d_{2})$ are independent chi-square variables, then:

F = \frac{U / d_{1}}{V / d_{2}} \sim F (d_{1}, d_{2})

The F-distribution has two degrees of freedom parameters:

$d_{1}$ : Numerator degrees of freedom.
$d_{2}$ : Denominator degrees of freedom.

Properties

Property	Value
Mean	$\frac{d_{2}}{d_{2} - 2}$ for $d_{2} > 2$
Mode	$\frac{d_{1} - 2}{d_{1}} \cdot \frac{d_{2}}{d_{2} + 2}$ for $d_{1} > 2$
Support	$[0, \infty)$ (strictly positive)
Skewness	Right-skewed, approaches symmetry as $d_{1}, d_{2} \to \infty$

Shape

Low df: Extremely right-skewed.
High df: Approaches Normal Distribution.
Asymmetry: $F (d_{1}, d_{2}) \neq F (d_{2}, d_{1})$ . Order matters!

Relationship to T-Distribution

t^{2} (d f) = F (1, d f)

The square of a t-statistic with $d f$ degrees of freedom is an F-statistic with $(1, d f)$ degrees of freedom.

Worked Example: Comparing Diet Plans

Problem

A researcher compares weight loss from 3 diet plans (A, B, C).

Between-Group Variability (Signal): Mean Square Between ( $M S_{B}$ ) = 50.
Within-Group Variability (Noise): Mean Square Error ( $M S_{E}$ ) = 10.
Degrees of Freedom: $d f_{1} = 2$ (3 groups - 1), $d f_{2} = 27$ (30 subjects - 3).

Question: Is there a significant difference between diets? ( $α = 0.05$ )

Solution:

Calculate F-Statistic:
$F = \frac{Signal}{Noise} = \frac{M S_{B}}{M S_{E}} = \frac{50}{10} = 5.0$
Critical Value:
- Lookup $F_{0.05, 2, 27}$ .
- Table value $\approx 3.35$ .
Decision:
- Since $5.0 > 3.35$ , we Reject $H_{0}$ .

Conclusion: The variability between potential diet effects is 5 times larger than the random noise. At least one diet is significantly different.

Intuition:
If $F \approx 1$ , the group differences are just random noise.
If $F ≫ 1$ , the group differences are "real".

Assumptions

F-tests assume:

Independence of observations.
Normality of data within groups (ANOVA).
Homogeneity of variance (for standard ANOVA; use Welch's if violated).

Limitations

Pitfalls

Heteroscedasticity Trap: If group variances are unequal (e.g., one group has huge spread), steady F-test gives false positives. Always Check Levene's Test. If significant, use Welch's F (ANOVA) or Heteroscedasticity-Consistent Standard Errors (Regression).
Non-Normality: F-test is somewhat robust to non-normality in large samples, but fails for skewed small samples.
Post-Hoc Amnesia: A significant F only says "Something is different." It doesn't say "A > B". You MUST run post-hoc tests (Tukey's HSD) to find where the difference is.

Python Implementation

from scipy.stats import f
import numpy as np
import matplotlib.pyplot as plt

# F-Distribution with df1=5, df2=20
df1, df2 = 5, 20
dist = f(df1, df2)

# Critical Value (95th percentile, one-tailed)
critical_value = dist.ppf(0.95)
print(f"F Critical Value (df1={df1}, df2={df2}, α=0.05): {critical_value:.3f}")

# P-value for observed F-statistic
observed_f = 3.2
p_value = 1 - dist.cdf(observed_f)
print(f"P-value for F = {observed_f}: {p_value:.4f}")

# Visualize Different df Combinations
x = np.linspace(0, 5, 500)
for (df1, df2) in [(2, 10), (5, 20), (10, 50)]:
    plt.plot(x, f(df1, df2).pdf(x), label=f'df1={df1}, df2={df2}')

plt.xlabel('F')
plt.ylabel('Density')
plt.title('F-Distribution for Various df')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

R Implementation

# Critical Value (df1=5, df2=20, α=0.05)
qf(0.95, df1 = 5, df2 = 20)

# P-value for observed F-statistic
observed_f <- 3.2
pf(observed_f, df1 = 5, df2 = 20, lower.tail = FALSE)

# Visualize
curve(df(x, df1 = 2, df2 = 10), from = 0, to = 5, col = "red", lwd = 2,
      ylab = "Density", xlab = "F", main = "F-Distributions")
curve(df(x, df1 = 5, df2 = 20), add = TRUE, col = "blue", lwd = 2)
curve(df(x, df1 = 10, df2 = 50), add = TRUE, col = "green", lwd = 2)
legend("topright", 
       legend = c("(2,10)", "(5,20)", "(10,50)"),
       col = c("red", "blue", "green"), lwd = 2, title = "(df1, df2)")

Interpretation Guide

Output	Interpretation
Output	Interpretation
--------	----------------
F = 1.0	Signal = Noise. No effect.
F < 1.0	Noise > Signal. Possible model misspecification or insufficient data.
F $≫$ critical value	Strong Signal. The groups/model explain significant variation.
P-value < 0.05	Reject $H_{0}$ . Proceed to post-hoc tests to verify specifics.

One-Way ANOVA - Primary application.
Chi-Square Distribution - F is the ratio of two chi-squares.
T-Distribution - Related via $t^{2} = F (1, d f)$ .
Levene's Test - Uses F-like statistic for variance equality.

F-Distribution

Definition

Purpose

When to Use

Theoretical Background

Definition

Properties

Shape

Relationship to T-Distribution

Worked Example: Comparing Diet Plans

Assumptions

Limitations

Python Implementation

R Implementation

Interpretation Guide

Related Concepts