Normal Distribution

Definition

Core Statement

The Normal Distribution (Gaussian Distribution) is a continuous probability distribution that is symmetric about the mean, with the property that data points near the mean are more frequent than those far from the mean. It is fully characterized by two parameters: mean ( $μ$ ) and standard deviation ( $σ$ ).

Purpose

The Normal Distribution is central to statistics because:

Many natural phenomena approximate normality (height, IQ, measurement error).
The Central Limit Theorem (CLT) guarantees the sampling distribution of means is normal.
Most parametric tests (T-tests, ANOVA, Regression) assume normality of residuals.

When to Use

Assume Normality When...

Modeling continuous variables known to be approximately normal.
Working with sample means of large samples (via CLT).
Residuals of a regression model have been verified to be normal.

Do NOT Assume Normality When...

Data is discrete (use Binomial, Poisson).
Data is bounded (e.g., percentages 0-100 often have Beta distribution).
Data has heavy tails or extreme outliers.

Theoretical Background

Probability Density Function (PDF)

f (x | μ, σ) = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{1}{2} {(\frac{x - μ}{σ})}^{2})

Properties

Property	Value
Mean	$μ$
Median	$μ$
Mode	$μ$
Variance	$σ^{2}$
Skewness	0 (Perfectly Symmetric)
Kurtosis	3 (Mesokurtic)

The Empirical Rule (68-95-99.7)

Memorize This

68.27% of data falls within $μ \pm 1 σ$
95.45% of data falls within $μ \pm 2 σ$
99.73% of data falls within $μ \pm 3 σ$

The Standard Normal Distribution ( $Z$ )

Any normal variable $X \sim N (μ, σ^{2})$ can be standardized to $Z \sim N (0, 1)$ :

Z = \frac{X - μ}{σ}

Interpretation: $Z$ represents how many standard deviations $X$ is away from the mean.

Assumptions

Continuous Data: The variable must be measured on a continuous (interval or ratio) scale.
Independence: Observations are independent.
Unbounded Range: Theoretically, normal data can range from $- \infty$ to $+ \infty$ . (Practically, bounded data may be approximately normal if not near bounds).

Limitations

Pitfalls

Normality is often assumed, not verified. Always run diagnostics.
Sensitivity to Outliers: Outliers can make data appear non-normal.
Misuse with Proportions/Counts: Never use normal distribution for inherently non-normal data types.

Assessing Normality

Visual Methods

Histogram: Should appear symmetric and bell-shaped.
Q-Q Plot: Points should fall on the 45-degree diagonal reference line.

Statistical Tests

Test	Best For	Null Hypothesis
Shapiro-Wilk Test	$n < 50$	Data is Normal
Kolmogorov-Smirnov	$n > 50$	Data is Normal
D'Agostino-Pearson	Based on Skew/Kurtosis	Data is Normal

Python Implementation

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

# --- Properties ---
mu, sigma = 100, 15  # IQ example

# P(X < 115) = ?
prob = stats.norm.cdf(115, loc=mu, scale=sigma)
print(f"P(X < 115): {prob:.4f}")

# What value is at the 90th percentile?
x_90 = stats.norm.ppf(0.90, loc=mu, scale=sigma)
print(f"90th Percentile: {x_90:.2f}")

# --- Generate and Test ---
data = np.random.normal(mu, sigma, 200)

# Shapiro-Wilk
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk p-value: {p:.4f}")

# Q-Q Plot
import statsmodels.api as sm
sm.qqplot(data, line='45')
plt.title("Q-Q Plot")
plt.show()

R Implementation

# --- Properties ---
mu <- 100
sigma <- 15

# P(X < 115)
pnorm(115, mean = mu, sd = sigma)

# 90th Percentile
qnorm(0.90, mean = mu, sd = sigma)

# --- Generate and Test ---
data <- rnorm(200, mean = mu, sd = sigma)

# Shapiro-Wilk Test
shapiro.test(data)

# Q-Q Plot
qqnorm(data, main = "Q-Q Plot for Normality")
qqline(data, col = "red", lwd = 2)

Interpretation Guide

Scenario	Interpretation
$Z = 0$	Observation is exactly at the mean.
$Z = 2$	Observation is 2 std. deviations above the mean (Top ~2.5%).
$Z = - 1.5$	Observation is 1.5 std. deviations below the mean.
Shapiro $p > 0.05$	Fail to reject $H_{0}$ ; data is consistent with normality.
Q-Q plot curves at tails	Data has heavier/lighter tails than normal (Kurtosis issue).

Central Limit Theorem (CLT) - Why sample means become normal.
T-Distribution - Normal's cousin for small samples.
Q-Q Plot - Visual diagnostic.
Shapiro-Wilk Test - Statistical diagnostic.
Log Transformations - A cure for non-normality.