Central Limit Theorem (CLT)

Central Limit Theorem (CLT)

Definition

Core Statement

The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean will approximate a Normal Distribution as the sample size (n) becomes sufficiently large, regardless of the shape of the population distribution.


Purpose

The CLT is the foundational pillar of inferential statistics. It justifies:

  1. Using Z-scores and T-scores to calculate Confidence Intervals.
  2. Using parametric tests (T-tests, ANOVA) on non-normal data when n is large.
  3. The assumption of normality for sample means in hypothesis testing.

When to Use

Rely on CLT When...

  • You are working with sample means (not raw individual data points).
  • Your sample size is n30 (classic rule of thumb).
  • You need to make inferences about a population mean from a sample.

CLT Does NOT Apply When...

  • Analyzing individual observations (not means).
  • Sample size is very small (n<15) and the population is heavily skewed.
  • Data has extreme outliers that distort the mean.


Theoretical Background

The Formal Statement

Let X1,X2,,Xn be a random sample of size n from a population with mean μ and finite variance σ2. Define the sample mean:

X¯n=1ni=1nXi

As n, the standardized sample mean converges in distribution to a Standard Normal:

Zn=X¯nμσ/ndN(0,1)

Key Implications

Concept Formula Meaning
Expected Value E[X¯]=μ The average of sample averages equals the population mean.
Standard Error (SE) SE=σn As n increases, the spread of the sampling distribution shrinks.
Precision n To double precision, you need 4× the sample size.

Assumptions


Worked Example: Delivery Truck Safety

Problem

A delivery truck can carry a maximum load of 2100 kg. It is loaded with 40 boxes.

  • The weight of any individual box is a random variable with Mean (μ) = 50 kg and Standard Deviation (σ) = 10 kg.
  • The distribution of box weights is non-normal (some heavy outliers, some light).

Question: What is the probability that the total weight exceeds the 2100 kg limit?

Solution:

  1. Identify Parameters:

    • n=40 (Sample size > 30, so CLT applies).
    • μsum=n×μ=40×50=2000 kg.
    • σsum=n×σ=40×106.325×10=63.25 kg.
    • Target Value (X) = 2100 kg.
  2. Calculate Z-Score:

    • Since n is large, we approximate the distribution of the sum as Normal.
    Z=Xμsumσsum=2100200063.25=10063.251.58
  3. Find Probability:

    • Looking up Z=1.58 in a Z-table gives an area of 0.9429 to the left.
    • P(Weight>2100)=10.9429=0.0571.

Interpretation:
There is approximately a 5.7% chance the truck will be overloaded, even though the expected weight (2000 kg) is well below the limit. This calculation relies on the CLT because the individual box weights are not normally distributed.


Limitations

Pitfalls

  1. The "Population Becomes Normal" Fallacy: A common misconception is that if n is large, the original data becomes normal. False. The population distribution remains exactly the same; only the distribution of means looks normal.
  2. "Large enough" is relative: For symmetric distributions, n15 may suffice. For highly skewed distributions (e.g., Exponential), n50 or more may be needed.
  3. Outliers: Extreme values can distort the mean, requiring even larger n for normality.
  4. Does not apply to medians or variances: CLT is specifically about sample means. The sampling distribution of the median or variance has different properties.


Python Implementation

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# --- Simulation ---
np.random.seed(42)

# 1. Create a Non-Normal Population (Exponential)
population = np.random.exponential(scale=2, size=100000)

# 2. Draw Many Samples and Calculate Means
sample_sizes = [5, 30, 100]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, n in enumerate(sample_sizes):
    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
    
    # Shapiro-Wilk Test for Normality
    _, p_val = stats.shapiro(sample_means)
    
    sns.histplot(sample_means, kde=True, ax=axes[i], stat='density')
    axes[i].set_title(f'n = {n} (Shapiro p = {p_val:.3f})')

plt.suptitle('CLT in Action: Exponential Pop -> Normal Sampling Distribution')
plt.tight_layout()
plt.show()

R Implementation

library(ggplot2)

set.seed(42)

# 1. Population (Exponential - Right Skewed)
pop <- rexp(100000, rate = 0.5)

# 2. Draw Samples
simulate_means <- function(n, reps = 1000) {
  replicate(reps, mean(sample(pop, n)))
}

means_5 <- simulate_means(5)
means_30 <- simulate_means(30)
means_100 <- simulate_means(100)

# 3. Combine and Plot
df <- data.frame(
  mean = c(means_5, means_30, means_100),
  n = factor(rep(c("n=5", "n=30", "n=100"), each=1000), 
             levels=c("n=5", "n=30", "n=100"))
)

ggplot(df, aes(x=mean, fill=n)) +
  geom_histogram(aes(y=..density..), bins=30, alpha=0.7) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(pop), sd = sd(pop)/sqrt(30)),
                color = "red", size=1, linetype="dashed") +
  facet_wrap(~n) +
  labs(title = "CLT Simulation: Exponential -> Normal") +
  theme_minimal()

Interpretation Guide

Observation Meaning
Sampling distributon is Bell-Shaped CLT is working; parametric inference (T-tests, Z-tests) is valid.
n<30 and Data Skewed CLT cannot be relied upon. Use non-parametric tests (e.g., Wilcoxon).
Standard Error decreases As n, the mean becomes more precise (SE=σ/n).
Sum vs Mean CLT applies to Sums as well as Means (Sum N(nμ,nσ2)).