Standard Deviation

Standard Deviation

Definition

Core Statement

Standard Deviation (SD or σ) quantifies the amount of variation or dispersion in a dataset. It measures how far individual observations typically deviate from the mean. A low SD indicates data points cluster close to the mean; a high SD indicates they are spread out.


Purpose

  1. Describe the spread of a distribution.
  2. Identify outliers (values > 2-3 SD from mean are unusual).
  3. Standardize scores (Z-scores = deviations in SD units).
  4. Foundation for Standard Error, Confidence Intervals, and inference.

When to Use

Use Standard Deviation When...

  • Describing the variability of continuous data.
  • Data is approximately normally distributed.
  • You need a measure of spread in the same units as the data.

Alternatives

  • Non-normal data: Use Interquartile Range (IQR) or Median Absolute Deviation (MAD).
  • Ordinal data: Use range or IQR.


Theoretical Background

Population vs Sample

Type Formula When to Use
Population SD (σ) σ=1Ni=1N(Xiμ)2 You have data for the entire population.
Sample SD (s) s=1n1i=1n(XiX¯)2 You have a sample from a population.
Why n1 (Bessel's Correction)?

Using n would underestimate the population variance. The n1 correction makes the sample variance an unbiased estimator of the population variance.

Variance (s2 or σ2)

Variance is the squared standard deviation. SD is preferred for interpretation because it's in the original units.

Properties


Assumptions


Limitations

Pitfalls

  1. Sensitive to Outliers: A single extreme value inflates SD. Use MAD for robustness.
  2. Assumes Symmetry: For skewed distributions, SD can be misleading. Report median and IQR instead.
  3. Not Comparable Across Different Scales: SD of income ($) vs SD of age (years) are meaningless to compare. Use Coefficient of Variation (CV = SD/Mean).


Python Implementation

import numpy as np

data = np.array([10, 12, 23, 23, 16, 23, 21, 16])

# Sample Standard Deviation (default: ddof=1)
sd_sample = np.std(data, ddof=1)
print(f"Sample SD: {sd_sample:.2f}")

# Population Standard Deviation (ddof=0)
sd_pop = np.std(data, ddof=0)
print(f"Population SD: {sd_pop:.2f}")

# Variance
var_sample = np.var(data, ddof=1)
print(f"Sample Variance: {var_sample:.2f}")

R Implementation

data <- c(10, 12, 23, 23, 16, 23, 21, 16)

# Sample Standard Deviation (default in R)
sd(data)

# Variance
var(data)

# Population SD (manual)
sqrt(sum((data - mean(data))^2) / length(data))

Interpretation Guide

Scenario Interpretation
SD = 5, Mean = 50 Typical value is within ±5 of 50.
SD = 0 All values are identical.
SD very large relative to mean High variability; data is spread out.
CV = SD/Mean = 0.1 Low relative variability (10%).