Binary Logistic Regression

Binary Logistic Regression

Definition

Core Statement

Binary Logistic Regression is the standard model for a binary (dichotomous) outcome variable (e.g., Yes/No, Success/Failure, 0/1). It estimates the probability P(Y=1|X) using the logistic (sigmoid) function.


Purpose

  1. Predict the probability of a binary event.
  2. Understand which factors increase or decrease the likelihood of the event via Odds Ratios.
  3. Classify observations into one of two groups based on a probability threshold.

When to Use

Use Binary Logistic Regression When...

  • Outcome is binary (two mutually exclusive categories).
  • Predictors can be continuous, categorical, or a mix.
  • You need interpretable effects in terms of Odds Ratios.

Do NOT Use When...


Theoretical Background

The Model

P(Y=1|X)=11+e(β0+β1X1+)=eβ0+β1X1+1+eβ0+β1X1+

This is the Sigmoid Function, which squashes the linear predictor into the range [0,1].

logit(P)=ln(P1P)=β0+β1X1+

Coefficients and Odds Ratios

Term Meaning
βj Change in log-odds for a 1-unit increase in Xj
eβj Odds Ratio (OR): Multiplicative change in odds for a 1-unit increase in Xj

Example:
If βage=0.05, then OR=e0.051.05.
"For every additional year of age, the odds of the event increase by 5%."


Assumptions

Events Per Variable (EPV)

If you have only 50 events (Y=1) and 10 predictors, you have EPV = 5, which is dangerously low. Coefficients will be unstable, and p-values unreliable. Reduce predictors or gather more data.


Limitations

Pitfalls

  1. Odds Ratio Relative Risk: OR overestimates Relative Risk when the outcome is common (>10%). For rare diseases, OR RR.
  2. Perfect Separation: If a predictor perfectly predicts the outcome (e.g., all Y=1 when X>5), MLE fails (coefficients go to infinity).
  3. Threshold Selection: Classification accuracy depends on the chosen probability threshold (default 0.5), which may not be optimal to maximize AUC or balance classes.


Model Evaluation

Unlike OLS (R2), Logistic Regression uses different metrics:

Metric Purpose
Pseudo R2 (McFadden's) Explained variance analog. 0.2-0.4 is excellent.
ROC & AUC Discrimination ability. Area under curve; 0.7-0.8 acceptable, >0.8 excellent.
Confusion Matrix Accuracy, Precision, Recall at a given threshold.
Hosmer-Lemeshow Test Calibration; do predicted probabilities match observed rates?

Python Implementation

import statsmodels.api as sm
import numpy as np
import pandas as pd

# 1. Fit Model
X = sm.add_constant(df[['Age', 'Income']])
y = df['Purchased']

model = sm.Logit(y, X).fit()
print(model.summary())

# 2. Odds Ratios with Confidence Intervals
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'Log-Odds']
conf['Odds Ratio'] = np.exp(conf['Log-Odds'])
conf['OR Lower'] = np.exp(conf['2.5%'])
conf['OR Upper'] = np.exp(conf['97.5%'])
print(conf[['Odds Ratio', 'OR Lower', 'OR Upper']])

# 3. Predict Probabilities
df['pred_prob'] = model.predict(X)

R Implementation

# 1. Fit Model (GLM with Binomial Family)
model <- glm(Purchased ~ Age + Income, data = df, family = "binomial")
summary(model)

# 2. Odds Ratios with CI
exp(cbind(OR = coef(model), confint(model)))

# 3. Predict Probabilities
df$pred_prob <- predict(model, type = "response")

# 4. Hosmer-Lemeshow Test
library(ResourceSelection)
hoslem.test(df$Purchased, fitted(model), g = 10)

Worked Numerical Example

Heart Disease Prediction

Scenario: Predict 10-year risk of heart disease (1=Yes, 0=No).
Predictor: Smoker (1=Yes, 0=No)

Results:

  • βsmoker = 0.85
  • p-value < 0.001

Calculations:

  • Log-Odds: Being a smoker increases the log-odds of heart disease by 0.85.
  • Odds Ratio (OR): e0.852.34.
    • "Smokers have 2.34 times higher odds of heart disease compared to non-smokers."

Probability Prediction (Intercept β0=2.0):

  • Non-Smoker (X=0): P=1/(1+e(2.0))0.12 (12% risk)
  • Smoker (X=1): z=2.0+0.85=1.15
    • P=1/(1+e(1.15))0.24 (24% risk)

Key Insight: OR stays constant (2.34), but absolute risk depends on the baseline.


Interpretation Guide

Output Example Interpretation Edge Case Notes
Coef (Age) 0.03 Each additional year increases log-odds by 0.03. Positive β OR > 1.
OR (Age) 1.03 Each additional year increases odds of event by 3%. If OR contains 1 in CI, not significant.
OR 0.50 Exposure halves the odds of the event (Protection). 0.50 is equivalent to 2.0 in opposite direction (1/0.5=2).
OR 25.0 Extremely large effect or Perfect Separation. Check if predictor perfectly predicts outcome (e.g., all X>10 have Y=1).
Predicted Prob 0.72 Model predicts 72% chance of event. If using threshold 0.5, classify as Y=1.

Common Pitfall Example

The Relative Risk Trap

Mistake: Interpreting Odds Ratios (OR) as Relative Risk (RR).

Scenario:

  • Study on a common outcome (e.g., "Recovery", rate = 50%).
  • Treatment OR = 2.0.

Incorrect Interpretation: "Treatment doubles the probability of recovery." (Implies Risk Ratio = 2.0).

Reality:

  • If Baseline Risk = 50% (Odds = 1:1 = 1.0).
  • Treatment Odds = 1.0×2.0=2.0 (Risk 67%).
  • Actual Risk Ratio: 67%/50%=1.34.
  • The OR (2.0) massive exaggerates the RR (1.34) because outcome is common.

Rule: OR RR only when the outcome is rare (<10%). Otherwise, report predicted probabilities.