Standard Deviation
Standard Deviation
Definition
Core Statement
Standard Deviation (SD or
Purpose
- Describe the spread of a distribution.
- Identify outliers (values > 2-3 SD from mean are unusual).
- Standardize scores (Z-scores = deviations in SD units).
- Foundation for Standard Error, Confidence Intervals, and inference.
When to Use
Use Standard Deviation When...
- Describing the variability of continuous data.
- Data is approximately normally distributed.
- You need a measure of spread in the same units as the data.
Alternatives
- Non-normal data: Use Interquartile Range (IQR) or Median Absolute Deviation (MAD).
- Ordinal data: Use range or IQR.
Theoretical Background
Population vs Sample
| Type | Formula | When to Use |
|---|---|---|
| Population SD ( |
You have data for the entire population. | |
| Sample SD ( |
You have a sample from a population. |
Why (Bessel's Correction)?
Using
Variance ( or )
Variance is the squared standard deviation. SD is preferred for interpretation because it's in the original units.
Properties
- Always non-negative:
. - Zero only if all values are identical.
- Affected by outliers: One extreme value can inflate SD dramatically.
Assumptions
Limitations
Pitfalls
- Sensitive to Outliers: A single extreme value inflates SD. Use MAD for robustness.
- Assumes Symmetry: For skewed distributions, SD can be misleading. Report median and IQR instead.
- Not Comparable Across Different Scales: SD of income ($) vs SD of age (years) are meaningless to compare. Use Coefficient of Variation (CV = SD/Mean).
Python Implementation
import numpy as np
data = np.array([10, 12, 23, 23, 16, 23, 21, 16])
# Sample Standard Deviation (default: ddof=1)
sd_sample = np.std(data, ddof=1)
print(f"Sample SD: {sd_sample:.2f}")
# Population Standard Deviation (ddof=0)
sd_pop = np.std(data, ddof=0)
print(f"Population SD: {sd_pop:.2f}")
# Variance
var_sample = np.var(data, ddof=1)
print(f"Sample Variance: {var_sample:.2f}")
R Implementation
data <- c(10, 12, 23, 23, 16, 23, 21, 16)
# Sample Standard Deviation (default in R)
sd(data)
# Variance
var(data)
# Population SD (manual)
sqrt(sum((data - mean(data))^2) / length(data))
Interpretation Guide
| Scenario | Interpretation |
|---|---|
| SD = 5, Mean = 50 | Typical value is within ±5 of 50. |
| SD = 0 | All values are identical. |
| SD very large relative to mean | High variability; data is spread out. |
| CV = SD/Mean = 0.1 | Low relative variability (10%). |
Related Concepts
- Standard Error - SD of a sampling distribution.
- Normal Distribution - 68-95-99.7 rule uses SD.
- Variance
- Coefficient of Variation
- Z-Scores