Central Limit Theorem (CLT)

Definition

Core Statement

The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean will approximate a Normal Distribution as the sample size ( $n$ ) becomes sufficiently large, regardless of the shape of the population distribution.

Purpose

The CLT is the foundational pillar of inferential statistics. It justifies:

Using Z-scores and T-scores to calculate Confidence Intervals.
Using parametric tests (T-tests, ANOVA) on non-normal data when $n$ is large.
The assumption of normality for sample means in hypothesis testing.

When to Use

Rely on CLT When...

You are working with sample means (not raw individual data points).
Your sample size is $n \geq 30$ (classic rule of thumb).
You need to make inferences about a population mean from a sample.

CLT Does NOT Apply When...

Analyzing individual observations (not means).
Sample size is very small ( $n < 15$ ) and the population is heavily skewed.
Data has extreme outliers that distort the mean.

Theoretical Background

The Formal Statement

Let $X_{1}, X_{2}, \dots, X_{n}$ be a random sample of size $n$ from a population with mean $μ$ and finite variance $σ^{2}$ . Define the sample mean:

{\bar{X}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}

As $n \to \infty$ , the standardized sample mean converges in distribution to a Standard Normal:

Z_{n} = \frac{{\bar{X}}_{n} - μ}{σ / \sqrt{n}} \overset{d}{\to} N (0, 1)

Key Implications

Concept	Formula	Meaning
Expected Value	$E [\bar{X}] = μ$	The average of sample averages equals the population mean.
Standard Error (SE)	$S E = \frac{σ}{\sqrt{n}}$	As $n$ increases, the spread of the sampling distribution shrinks.
Precision	$\propto \sqrt{n}$	To double precision, you need $4 \times$ the sample size.

Assumptions

Random Sampling: Each observation must be drawn randomly from the population.
Independence: Observations must be independent of each other. (If sampling without replacement, sample should be $< 10 %$ of population).
Finite Variance: The population must have a finite standard deviation ( $σ < \infty$ ). (Violated by Cauchy distribution).

Worked Example: Delivery Truck Safety

Problem

A delivery truck can carry a maximum load of 2100 kg. It is loaded with 40 boxes.

The weight of any individual box is a random variable with Mean ( $μ$ ) = 50 kg and Standard Deviation ( $σ$ ) = 10 kg.
The distribution of box weights is non-normal (some heavy outliers, some light).

Question: What is the probability that the total weight exceeds the 2100 kg limit?

Solution:

Identify Parameters:
- $n = 40$ (Sample size > 30, so CLT applies).
- $μ_{s u m} = n \times μ = 40 \times 50 = 2000 kg$ .
- $σ_{s u m} = \sqrt{n} \times σ = \sqrt{40} \times 10 \approx 6.325 \times 10 = 63.25 kg$ .
- Target Value ( $X$ ) = 2100 kg.
Calculate Z-Score:
- Since $n$ is large, we approximate the distribution of the sum as Normal.
$Z = \frac{X - μ_{s u m}}{σ_{s u m}} = \frac{2100 - 2000}{63.25} = \frac{100}{63.25} \approx 1.58$
Find Probability:
- Looking up $Z = 1.58$ in a Z-table gives an area of $0.9429$ to the left.
- $P (W e i g h t > 2100) = 1 - 0.9429 = 0.0571$ .

Interpretation:
There is approximately a 5.7% chance the truck will be overloaded, even though the expected weight (2000 kg) is well below the limit. This calculation relies on the CLT because the individual box weights are not normally distributed.

Limitations

Pitfalls

The "Population Becomes Normal" Fallacy: A common misconception is that if $n$ is large, the original data becomes normal. False. The population distribution remains exactly the same; only the distribution of means looks normal.
"Large enough" is relative: For symmetric distributions, $n \geq 15$ may suffice. For highly skewed distributions (e.g., Exponential), $n \geq 50$ or more may be needed.
Outliers: Extreme values can distort the mean, requiring even larger $n$ for normality.
Does not apply to medians or variances: CLT is specifically about sample means. The sampling distribution of the median or variance has different properties.

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# --- Simulation ---
np.random.seed(42)

# 1. Create a Non-Normal Population (Exponential)
population = np.random.exponential(scale=2, size=100000)

# 2. Draw Many Samples and Calculate Means
sample_sizes = [5, 30, 100]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, n in enumerate(sample_sizes):
    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
    
    # Shapiro-Wilk Test for Normality
    _, p_val = stats.shapiro(sample_means)
    
    sns.histplot(sample_means, kde=True, ax=axes[i], stat='density')
    axes[i].set_title(f'n = {n} (Shapiro p = {p_val:.3f})')

plt.suptitle('CLT in Action: Exponential Pop -> Normal Sampling Distribution')
plt.tight_layout()
plt.show()

R Implementation

library(ggplot2)

set.seed(42)

# 1. Population (Exponential - Right Skewed)
pop <- rexp(100000, rate = 0.5)

# 2. Draw Samples
simulate_means <- function(n, reps = 1000) {
  replicate(reps, mean(sample(pop, n)))
}

means_5 <- simulate_means(5)
means_30 <- simulate_means(30)
means_100 <- simulate_means(100)

# 3. Combine and Plot
df <- data.frame(
  mean = c(means_5, means_30, means_100),
  n = factor(rep(c("n=5", "n=30", "n=100"), each=1000), 
             levels=c("n=5", "n=30", "n=100"))
)

ggplot(df, aes(x=mean, fill=n)) +
  geom_histogram(aes(y=..density..), bins=30, alpha=0.7) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(pop), sd = sd(pop)/sqrt(30)),
                color = "red", size=1, linetype="dashed") +
  facet_wrap(~n) +
  labs(title = "CLT Simulation: Exponential -> Normal") +
  theme_minimal()

Interpretation Guide

Observation	Meaning
Sampling distributon is Bell-Shaped	CLT is working; parametric inference (T-tests, Z-tests) is valid.
$n < 30$ and Data Skewed	CLT cannot be relied upon. Use non-parametric tests (e.g., Wilcoxon).
Standard Error decreases	As $n ↑$ , the mean becomes more precise ( $S E = σ / \sqrt{n}$ ).
Sum vs Mean	CLT applies to Sums as well as Means (Sum $\sim N (n μ, n σ^{2})$ ).

Normal Distribution - The target distribution.
Law of Large Numbers - Related but distinct (LLN is about accuracy, CLT is about distribution shape).
Standard Error - Derived directly from CLT.
T-Distribution - Used when $σ$ is unknown.
Confidence Intervals - Built on CLT logic.