Degrees of Freedom

Degrees of Freedom

Definition

Core Statement

Degrees of Freedom (df) represent the number of independent pieces of information available to estimate a parameter. Conceptually, it's the number of values that are "free to vary" after certain constraints are imposed.


Purpose

  1. Determine the shape of distributions (T-Distribution, Chi-Square Distribution, F-Distribution).
  2. Adjust for model complexity in hypothesis tests and regression.
  3. Calculate unbiased estimates (e.g., sample variance uses n1).

When to Use

Every statistical test and estimation procedure involves degrees of freedom. Understanding df helps interpret:


Theoretical Background

Intuition

Imagine you have 3 numbers with a known mean of 10:

Why n1 for Sample Variance?

When calculating sample variance:

s2=1n1(XiX¯)2

We lose 1 degree of freedom because we estimated X¯ from the data. The deviations (XiX¯) must sum to zero, creating a constraint.

Degrees of Freedom in Common Tests

Test df Explanation
One-Sample T-Test n1 Estimate mean; lose 1 df.
Two-Sample T-Test n1+n22 Estimate 2 means; lose 2 df.
Simple Linear Regression n2 Estimate slope and intercept; lose 2 df.
Multiple Regression nk1 Estimate k slopes + intercept; lose k+1 df.
Chi-Square (Independence) (r1)(c1) Constraints from row/column totals.
One-Way ANOVA dfbetween=k1, dfwithin=Nk k group means estimated.

Assumptions

Degrees of freedom is a mathematical concept, not an assumption. However:


Limitations

Pitfalls

  1. Low df = Low Power: With very small df (e.g., df=3), tests have low power and wide confidence intervals.
  2. Complexity Penalty: Adding predictors "uses up" degrees of freedom, reducing residual df and potentially overfitting.


Python Implementation

from scipy import stats
import numpy as np

# Example: One-Sample T-Test
data = np.array([10, 12, 14, 16, 18])
n = len(data)
df = n - 1  # Degrees of Freedom

t_stat, p_val = stats.ttest_1samp(data, popmean=10)
print(f"Sample Size: {n}")
print(f"Degrees of Freedom: {df}")
print(f"T-statistic: {t_stat:.2f}")
print(f"T-critical (95%, df={df}): {stats.t.ppf(0.975, df):.2f}")

R Implementation

# Example: Regression
model <- lm(Y ~ X1 + X2, data = df)

# Residual DF
df_residual <- df.residual(model)
cat("Residual df:", df_residual, "\n")

# Total DF
n <- nrow(df)
cat("Total df:", n - 1, "\n")

Interpretation Guide

Scenario Effect of df
Large df (e.g., 100+) T-distribution approaches Normal; narrow CIs; high power.
Small df (e.g., 5) T-distribution has heavy tails; wide CIs; low power.
df increases in regression Adding predictors reduces residual df; risk of overfitting.
Chi-Square with low df Critical values are lower; easier to reject H0.