Binary Logistic Regression
Binary Logistic Regression
Definition
Binary Logistic Regression is the standard model for a binary (dichotomous) outcome variable (e.g., Yes/No, Success/Failure, 0/1). It estimates the probability
Purpose
- Predict the probability of a binary event.
- Understand which factors increase or decrease the likelihood of the event via Odds Ratios.
- Classify observations into one of two groups based on a probability threshold.
When to Use
- Outcome is binary (two mutually exclusive categories).
- Predictors can be continuous, categorical, or a mix.
- You need interpretable effects in terms of Odds Ratios.
- Outcome has more than 2 categories (use Multinomial Logistic Regression (MNLogit) or Ordinal Logistic Regression).
- Outcome is a count (use Poisson Regression).
Theoretical Background
The Model
This is the Sigmoid Function, which squashes the linear predictor into the range
Logit Form (Link Function)
Coefficients and Odds Ratios
| Term | Meaning |
|---|---|
| Change in log-odds for a 1-unit increase in |
|
| Odds Ratio (OR): Multiplicative change in odds for a 1-unit increase in |
Example:
If
"For every additional year of age, the odds of the event increase by 5%."
Assumptions
If you have only 50 events (Y=1) and 10 predictors, you have EPV = 5, which is dangerously low. Coefficients will be unstable, and p-values unreliable. Reduce predictors or gather more data.
Limitations
- Odds Ratio
Relative Risk: OR overestimates Relative Risk when the outcome is common (>10%). For rare diseases, OR RR. - Perfect Separation: If a predictor perfectly predicts the outcome (e.g., all Y=1 when X>5), MLE fails (coefficients go to infinity).
- Threshold Selection: Classification accuracy depends on the chosen probability threshold (default 0.5), which may not be optimal to maximize AUC or balance classes.
Model Evaluation
Unlike OLS (
| Metric | Purpose |
|---|---|
| Pseudo |
Explained variance analog. 0.2-0.4 is excellent. |
| ROC & AUC | Discrimination ability. Area under curve; 0.7-0.8 acceptable, >0.8 excellent. |
| Confusion Matrix | Accuracy, Precision, Recall at a given threshold. |
| Hosmer-Lemeshow Test | Calibration; do predicted probabilities match observed rates? |
Python Implementation
import statsmodels.api as sm
import numpy as np
import pandas as pd
# 1. Fit Model
X = sm.add_constant(df[['Age', 'Income']])
y = df['Purchased']
model = sm.Logit(y, X).fit()
print(model.summary())
# 2. Odds Ratios with Confidence Intervals
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'Log-Odds']
conf['Odds Ratio'] = np.exp(conf['Log-Odds'])
conf['OR Lower'] = np.exp(conf['2.5%'])
conf['OR Upper'] = np.exp(conf['97.5%'])
print(conf[['Odds Ratio', 'OR Lower', 'OR Upper']])
# 3. Predict Probabilities
df['pred_prob'] = model.predict(X)
R Implementation
# 1. Fit Model (GLM with Binomial Family)
model <- glm(Purchased ~ Age + Income, data = df, family = "binomial")
summary(model)
# 2. Odds Ratios with CI
exp(cbind(OR = coef(model), confint(model)))
# 3. Predict Probabilities
df$pred_prob <- predict(model, type = "response")
# 4. Hosmer-Lemeshow Test
library(ResourceSelection)
hoslem.test(df$Purchased, fitted(model), g = 10)
Worked Numerical Example
Scenario: Predict 10-year risk of heart disease (1=Yes, 0=No).
Predictor: Smoker (1=Yes, 0=No)
Results:
= 0.85 -value < 0.001
Calculations:
- Log-Odds: Being a smoker increases the log-odds of heart disease by 0.85.
- Odds Ratio (OR):
. - "Smokers have 2.34 times higher odds of heart disease compared to non-smokers."
Probability Prediction (Intercept
- Non-Smoker (
): (12% risk) - Smoker (
): (24% risk)
Key Insight: OR stays constant (2.34), but absolute risk depends on the baseline.
Interpretation Guide
| Output | Example | Interpretation | Edge Case Notes |
|---|---|---|---|
| Coef (Age) | 0.03 | Each additional year increases log-odds by 0.03. | Positive |
| OR (Age) | 1.03 | Each additional year increases odds of event by 3%. | If OR contains 1 in CI, not significant. |
| OR | 0.50 | Exposure halves the odds of the event (Protection). | |
| OR | 25.0 | Extremely large effect or Perfect Separation. | Check if predictor perfectly predicts outcome (e.g., all X>10 have Y=1). |
| Predicted Prob | 0.72 | Model predicts 72% chance of event. | If using threshold 0.5, classify as Y=1. |
Common Pitfall Example
Mistake: Interpreting Odds Ratios (OR) as Relative Risk (RR).
Scenario:
- Study on a common outcome (e.g., "Recovery", rate = 50%).
- Treatment OR = 2.0.
Incorrect Interpretation: "Treatment doubles the probability of recovery." (Implies Risk Ratio = 2.0).
Reality:
- If Baseline Risk = 50% (Odds = 1:1 = 1.0).
- Treatment Odds =
(Risk 67%). - Actual Risk Ratio:
. - The OR (2.0) massive exaggerates the RR (1.34) because outcome is common.
Rule: OR
Related Concepts
- Logistic Regression - Family overview.
- Maximum Likelihood Estimation (MLE)
- ROC & AUC - Performance metric.
- Confusion Matrix - Classification metrics.
- Probit Regression - Alternative using normal CDF.