Poisson Regression
Poisson Regression
Definition
Poisson Regression is a Generalized Linear Model (GLM) used for modeling count data---non-negative integers representing the number of times an event occurs (e.g., accidents, website clicks, goals scored). It assumes the response variable follows a Poisson distribution.
Purpose
- Model counts as a function of predictors.
- Interpret coefficients as Rate Ratios (multiplicative changes in expected count).
- Serve as a baseline for more complex count models (Negative Binomial Regression, Zero-Inflated Models).
When to Use
- Outcome is a count (0, 1, 2, ...).
- Counts represent events occurring in a fixed interval (time, space, etc.).
- Mean = Variance (Equidispersion).
- Variance > Mean (Overdispersion). Use Negative Binomial Regression.
- Excess Zeros. Use Zero-Inflated Models.
- Outcome is continuous or binary.
Theoretical Background
The Poisson Distribution
A discrete distribution for the number of events in a fixed interval.
where
Key Property:
The Model (Log Link)
Poisson regression models the log of the expected count:
Rate Ratio Interpretation
Since the link is logarithmic, coefficients are multiplicative on the original scale.
"A 1-unit increase in
Example: If
Assumptions
Checking Overdispersion
Calculate the Dispersion Statistic:
: Equidispersion. Poisson is OK. : Overdispersion. Use Negative Binomial. : Underdispersion. (Rare).
Limitations
- Overdispersion is Common: Real-world count data often has Variance > Mean. Ignoring this leads to underestimated standard errors and inflated Type I error.
- Excess Zeros: Many real datasets have more zeros than Poisson predicts (e.g., "never visited customers"). Use Zero-Inflated Models.
- Exposure/Offset: If observation periods differ (e.g., some patients observed for 1 year, others for 2), you need an offset term to model rates.
Python Implementation
import statsmodels.api as sm
import statsmodels.formula.api as smf
# 1. Fit Poisson GLM
model = smf.glm("num_awards ~ math_score + prog", data=df,
family=sm.families.Poisson()).fit()
print(model.summary())
# 2. Rate Ratios
import numpy as np
print("\n--- Rate Ratios ---")
print(np.exp(model.params))
# 3. Check Overdispersion
dispersion = model.pearson_chi2 / model.df_resid
print(f"\nDispersion Statistic: {dispersion:.3f}")
if dispersion > 1.5:
print("Warning: Overdispersion detected. Consider Negative Binomial.")
R Implementation
# 1. Fit Poisson GLM
model <- glm(num_awards ~ math_score + prog, data = df, family = poisson)
summary(model)
# 2. Rate Ratios
exp(coef(model))
exp(confint(model))
# 3. Check Overdispersion
# Residual Deviance / Residual DF should be ~ 1
dispersion <- deviance(model) / df.residual(model)
cat("Dispersion:", dispersion, "\n")
if(dispersion > 1.5) {
cat("Use MASS::glm.nb() for Negative Binomial\n")
}
Worked Numerical Example
Scenario: Predict daily page clicks based on Ad Spend ($).
Model: Poisson Regression
Results:
- Intercept (
) = 4.6 (Baseline log-count) = 0.002
Calculations:
- Baseline Clicks (Spend = 0):
clicks. - Effect of $100 Spend:
- Multiplier =
. - Expected Clicks =
.
- Multiplier =
Interpretation: Spending $100 increases expected traffic by 22%.
Interpretation Guide
| Output | Example | Interpretation | Edge Case Notes |
|---|---|---|---|
| Coef (X) | 0.07 | Log of expected count increases by 0.07 per unit. | Hard to interpret directly; use RR. |
| RR (X) | 1.07 | Each unit increase in X increases count by 7%. | If RR < 1, count decreases (e.g., 0.8 = 20% drop). |
| Dispersion | 1.0 | Perfect equidispersion (Mean = Variance). | Ideal Poisson case. |
| Dispersion | 2.3 | Overdispersion. Variance > Mean. | Standard errors are wrong. Switch to Negative Binomial Regression. |
| Dispersion | 0.5 | Underdispersion. Variance < Mean. | Rare. Could be zero-truncated or specific process constraint. |
Common Pitfall Example
Scenario: Modeling number of fish caught by fishermen.
Data: Mean = 5, Variance = 25 (Variance >> Mean).
The Error:
- Analyst fits Poisson.
- Finds
significant ( ). - Reports results.
The Problem:
- Poisson assumes Mean = Variance.
- With variance 5x the mean, true standard errors should be
times larger. - The reported p-value is way too optimistic.
Correction:
- Use Quasipoisson or Negative Binomial.
- Corrected p-value might be 0.06 (not significant!).
Related Concepts
- Negative Binomial Regression - Handles overdispersion.
- Zero-Inflated Models - Handles excess zeros.
- Generalized Linear Models (GLM)
- Maximum Likelihood Estimation (MLE)