Boxplot

Boxplot

Definition

Core Statement

A Boxplot (or Box-and-Whisker Plot) is a standardized way of displaying the distribution of data based on a five-number summary: Minimum, First Quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum. It is the primary tool for identifying outliers and visualizing skewness.


Purpose

  1. Identify Outliers: Points outside the "whiskers" are explicit outliers.
  2. Visualize Spread: The box size (IQR) shows the variability of the middle 50% of data.
  3. Comparisons: Ideally suited for comparing distributions side-by-side across groups (e.g., Salary by Department).
  4. Detect Skewness: Asymmetry in the box or whiskers indicates skew.

The Anatomy of a Boxplot

Boxplot Anatomy

  1. Median (Q2, 50th percentile): The line inside the box.
  2. The Box (IQR): Spans from Q1 (25th percentile) to Q3 (75th percentile). Contains the middle 50% of data.
  3. The Whiskers: Extend to the furthest data point within 1.5×IQR.
  4. Outliers (Diamonds/Dots): Any point beyond the whiskers.

Theoretical Background

Calculating IQR and Fences

Any data point x s.t. x<Lower Fence or x>Upper Fence is an Outlier.


Worked Example: Detecting Outliers

Problem

Data: [10,12,13,15,16,19,20,22,100].
Task: Draw boxplot and find outliers.

  1. Find Quartiles:

    • Sort: 10,12,13,15,16,19,20,22,100.
    • Median (Q2): 16.
    • Q1 (Median of first half 10,12,13,15): 12.5.
    • Q3 (Median of second half 19,20,22,100): 21.
  2. Calculate IQR:

    • IQR=2112.5=8.5.
  3. Calculate Fences:

    • Lower: 12.5(1.5×8.5)=12.512.75=0.25.
    • Upper: 21+(1.5×8.5)=21+12.75=33.75.
  4. Identify Outliers:

    • Is 100>33.75? Yes. 100 is an outlier.
    • Is 10<0.25? No.

Visualization: The box spans 12.5 to 21. The right whisker ends at 22. The dot at 100 is isolated.


Interpretation Guide

Visual Assessment Meaning
Median is center of box Symmetric distribution (Normal-ish).
Median closer to bottom Right-Skewed (Postive skew). Tail extends up.
Median closer to top Left-Skewed (Negative skew). Tail extends down.
Many outliers Heavy-tailed distribution (e.g., Cauchy / Log-Normal).

Limitations & Pitfalls

Pitfalls

  1. Hides Multimodality: A boxplot cannot distinguish between a Normal (bell) peak and a Bimodal (two-peak) distribution. They might have identical quartiles. Use a Violin Plot or Histogram to check shape.
  2. Sample Size Blindness: A boxplot of n=5 looks just as authoritative as n=5000. Always annotate with n.
  3. The "1.5" rule is arbitrary: Using 1.5 IQR works well for Normal data (0.7% outliers), but for naturally skewed data, it marks too many valid points as outliers.


Python Implementation

import seaborn as sns
import matplotlib.pyplot as plt

# Comparative Boxplot
sns.boxplot(x='Department', y='Salary', data=df)
plt.title("Salary Distribution by Department")
plt.show()

# Detect Outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]