Poisson Regression

Definition

Core Statement

Poisson Regression is a Generalized Linear Model (GLM) used for modeling count data---non-negative integers representing the number of times an event occurs (e.g., accidents, website clicks, goals scored). It assumes the response variable follows a Poisson distribution.

Purpose

Model counts as a function of predictors.
Interpret coefficients as Rate Ratios (multiplicative changes in expected count).
Serve as a baseline for more complex count models (Negative Binomial Regression, Zero-Inflated Models).

When to Use

Use Poisson Regression When...

Outcome is a count (0, 1, 2, ...).
Counts represent events occurring in a fixed interval (time, space, etc.).
Mean = Variance (Equidispersion).

Do NOT Use Poisson When...

Variance > Mean (Overdispersion). Use Negative Binomial Regression.
Excess Zeros. Use Zero-Inflated Models.
Outcome is continuous or binary.

Theoretical Background

The Poisson Distribution

A discrete distribution for the number of events in a fixed interval.

P (Y = k) = \frac{λ^{k} e^{- λ}}{k!}

where $λ$ is the expected count (rate).

Key Property: $E [Y] = V a r (Y) = λ$ .

The Model (Log Link)

Poisson regression models the log of the expected count:

\ln (λ) = β_{0} + β_{1} X_{1} + \dots

λ = e^{β_{0} + β_{1} X_{1} + \dots}

Rate Ratio Interpretation

Since the link is logarithmic, coefficients are multiplicative on the original scale.

R R = e^{β_{j}}

"A 1-unit increase in $X_{j}$ multiplies the expected count by $e^{β_{j}}$ ."

Example: If $β = 0.693$ , then $R R = e^{0.693} = 2.0$ . Each unit increase in $X$ doubles the expected count.

Assumptions

Count Data: Outcome must be non-negative integers.
Poisson Distribution: Events occur independently at a constant average rate.
Equidispersion: Mean = Variance. (Critical; often violated!)
Independence: Observations are independent.
Log-linearity: The log of the expected count is a linear function of predictors.

Checking Overdispersion

Always Check!

Calculate the Dispersion Statistic:

ϕ = \frac{Pearson χ^{2}}{Residual Degrees of Freedom}

$ϕ \approx 1$ : Equidispersion. Poisson is OK.
$ϕ > 1.5$ : Overdispersion. Use Negative Binomial.
$ϕ < 1$ : Underdispersion. (Rare).

Limitations

Pitfalls

Overdispersion is Common: Real-world count data often has Variance > Mean. Ignoring this leads to underestimated standard errors and inflated Type I error.
Excess Zeros: Many real datasets have more zeros than Poisson predicts (e.g., "never visited customers"). Use Zero-Inflated Models.
Exposure/Offset: If observation periods differ (e.g., some patients observed for 1 year, others for 2), you need an offset term to model rates.

Python Implementation

import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. Fit Poisson GLM
model = smf.glm("num_awards ~ math_score + prog", data=df, 
                family=sm.families.Poisson()).fit()
print(model.summary())

# 2. Rate Ratios
import numpy as np
print("\n--- Rate Ratios ---")
print(np.exp(model.params))

# 3. Check Overdispersion
dispersion = model.pearson_chi2 / model.df_resid
print(f"\nDispersion Statistic: {dispersion:.3f}")
if dispersion > 1.5:
    print("Warning: Overdispersion detected. Consider Negative Binomial.")

R Implementation

# 1. Fit Poisson GLM
model <- glm(num_awards ~ math_score + prog, data = df, family = poisson)
summary(model)

# 2. Rate Ratios
exp(coef(model))
exp(confint(model))

# 3. Check Overdispersion
# Residual Deviance / Residual DF should be ~ 1
dispersion <- deviance(model) / df.residual(model)
cat("Dispersion:", dispersion, "\n")

if(dispersion > 1.5) {
  cat("Use MASS::glm.nb() for Negative Binomial\n")
}

Worked Numerical Example

Website Traffic Analysis

Scenario: Predict daily page clicks based on Ad Spend ($).
Model: Poisson Regression

Results:

Intercept ( $β_{0}$ ) = 4.6 (Baseline log-count)
$β_{S p e n d}$ = 0.002

Calculations:

Baseline Clicks (Spend = 0): $e^{4.6} \approx 100$ clicks.
Effect of $100 Spend:
- Multiplier = $e^{0.002 \times 100} = e^{0.2} \approx 1.22$ .
- Expected Clicks = $100 \times 1.22 = 122$ .

Interpretation: Spending $100 increases expected traffic by 22%.

Interpretation Guide

Output	Example	Interpretation	Edge Case Notes
Coef (X)	0.07	Log of expected count increases by 0.07 per unit.	Hard to interpret directly; use RR.
RR (X)	1.07	Each unit increase in X increases count by 7%.	If RR < 1, count decreases (e.g., 0.8 = 20% drop).
Dispersion	1.0	Perfect equidispersion (Mean = Variance).	Ideal Poisson case.
Dispersion	2.3	Overdispersion. Variance > Mean.	Standard errors are wrong. Switch to Negative Binomial Regression.
Dispersion	0.5	Underdispersion. Variance < Mean.	Rare. Could be zero-truncated or specific process constraint.

Common Pitfall Example

Ignoring Overdispersion

Scenario: Modeling number of fish caught by fishermen.
Data: Mean = 5, Variance = 25 (Variance >> Mean).

The Error:

Analyst fits Poisson.
Finds $β_{b a i t}$ significant ( $p < 0.001$ ).
Reports results.

The Problem:

Poisson assumes Mean = Variance.
With variance 5x the mean, true standard errors should be $\approx \sqrt{5} = 2.2$ times larger.
The reported p-value is way too optimistic.

Correction:

Use Quasipoisson or Negative Binomial.
Corrected p-value might be 0.06 (not significant!).

Negative Binomial Regression - Handles overdispersion.
Zero-Inflated Models - Handles excess zeros.
Generalized Linear Models (GLM)
Maximum Likelihood Estimation (MLE)