Poisson Regression

Poisson Regression

Definition

Core Statement

Poisson Regression is a Generalized Linear Model (GLM) used for modeling count data---non-negative integers representing the number of times an event occurs (e.g., accidents, website clicks, goals scored). It assumes the response variable follows a Poisson distribution.


Purpose

  1. Model counts as a function of predictors.
  2. Interpret coefficients as Rate Ratios (multiplicative changes in expected count).
  3. Serve as a baseline for more complex count models (Negative Binomial Regression, Zero-Inflated Models).

When to Use

Use Poisson Regression When...

  • Outcome is a count (0, 1, 2, ...).
  • Counts represent events occurring in a fixed interval (time, space, etc.).
  • Mean = Variance (Equidispersion).

Do NOT Use Poisson When...


Theoretical Background

The Poisson Distribution

A discrete distribution for the number of events in a fixed interval.

P(Y=k)=λkeλk!

where λ is the expected count (rate).

Key Property: E[Y]=Var(Y)=λ.

Poisson regression models the log of the expected count:

ln(λ)=β0+β1X1+λ=eβ0+β1X1+

Rate Ratio Interpretation

Since the link is logarithmic, coefficients are multiplicative on the original scale.

RR=eβj

"A 1-unit increase in Xj multiplies the expected count by eβj."

Example: If β=0.693, then RR=e0.693=2.0. Each unit increase in X doubles the expected count.


Assumptions


Checking Overdispersion

Always Check!

Calculate the Dispersion Statistic:

ϕ=Pearson χ2Residual Degrees of Freedom
  • ϕ1: Equidispersion. Poisson is OK.
  • ϕ>1.5: Overdispersion. Use Negative Binomial.
  • ϕ<1: Underdispersion. (Rare).

Limitations

Pitfalls

  1. Overdispersion is Common: Real-world count data often has Variance > Mean. Ignoring this leads to underestimated standard errors and inflated Type I error.
  2. Excess Zeros: Many real datasets have more zeros than Poisson predicts (e.g., "never visited customers"). Use Zero-Inflated Models.
  3. Exposure/Offset: If observation periods differ (e.g., some patients observed for 1 year, others for 2), you need an offset term to model rates.


Python Implementation

import statsmodels.api as sm
import statsmodels.formula.api as smf

# 1. Fit Poisson GLM
model = smf.glm("num_awards ~ math_score + prog", data=df, 
                family=sm.families.Poisson()).fit()
print(model.summary())

# 2. Rate Ratios
import numpy as np
print("\n--- Rate Ratios ---")
print(np.exp(model.params))

# 3. Check Overdispersion
dispersion = model.pearson_chi2 / model.df_resid
print(f"\nDispersion Statistic: {dispersion:.3f}")
if dispersion > 1.5:
    print("Warning: Overdispersion detected. Consider Negative Binomial.")

R Implementation

# 1. Fit Poisson GLM
model <- glm(num_awards ~ math_score + prog, data = df, family = poisson)
summary(model)

# 2. Rate Ratios
exp(coef(model))
exp(confint(model))

# 3. Check Overdispersion
# Residual Deviance / Residual DF should be ~ 1
dispersion <- deviance(model) / df.residual(model)
cat("Dispersion:", dispersion, "\n")

if(dispersion > 1.5) {
  cat("Use MASS::glm.nb() for Negative Binomial\n")
}

Worked Numerical Example

Website Traffic Analysis

Scenario: Predict daily page clicks based on Ad Spend ($).
Model: Poisson Regression

Results:

  • Intercept (β0) = 4.6 (Baseline log-count)
  • βSpend = 0.002

Calculations:

  • Baseline Clicks (Spend = 0): e4.6100 clicks.
  • Effect of $100 Spend:
    • Multiplier = e0.002×100=e0.21.22.
    • Expected Clicks = 100×1.22=122.

Interpretation: Spending $100 increases expected traffic by 22%.


Interpretation Guide

Output Example Interpretation Edge Case Notes
Coef (X) 0.07 Log of expected count increases by 0.07 per unit. Hard to interpret directly; use RR.
RR (X) 1.07 Each unit increase in X increases count by 7%. If RR < 1, count decreases (e.g., 0.8 = 20% drop).
Dispersion 1.0 Perfect equidispersion (Mean = Variance). Ideal Poisson case.
Dispersion 2.3 Overdispersion. Variance > Mean. Standard errors are wrong. Switch to Negative Binomial Regression.
Dispersion 0.5 Underdispersion. Variance < Mean. Rare. Could be zero-truncated or specific process constraint.

Common Pitfall Example

Ignoring Overdispersion

Scenario: Modeling number of fish caught by fishermen.
Data: Mean = 5, Variance = 25 (Variance >> Mean).

The Error:

  • Analyst fits Poisson.
  • Finds βbait significant (p<0.001).
  • Reports results.

The Problem:

  • Poisson assumes Mean = Variance.
  • With variance 5x the mean, true standard errors should be 5=2.2 times larger.
  • The reported p-value is way too optimistic.

Correction:

  • Use Quasipoisson or Negative Binomial.
  • Corrected p-value might be 0.06 (not significant!).