Simple Linear Regression
Simple Linear Regression
Definition
Simple Linear Regression (SLR) models the linear relationship between a single continuous independent variable (
Purpose
- Prediction: Estimate
for a given . - Explanation: Quantify the strength and direction of the association between
and . - Testing: Determine if the relationship is statistically significant (
).
When to Use
- You have one continuous predictor and one continuous outcome.
- You believe the relationship is linear.
- You want to understand/test the bivariate association.
- You have multiple predictors (use Multiple Linear Regression).
- The outcome is categorical (use Logistic Regression).
- The relationship is clearly non-linear (consider polynomial or GAM).
Theoretical Background
The Model Equation
| Term | Name | Interpretation |
|---|---|---|
| Dependent Variable | The outcome we are predicting. | |
| Independent Variable | The predictor. | |
| Intercept | Expected value of |
|
| Slope | Change in |
|
| Error Term | Random noise; captures unexplained variation. |
OLS Estimation
The OLS method finds
Closed-Form Solutions:
Coefficient of Determination ( )
Interpretation: The proportion of variance in
: 70% of the variability in is explained by .
Assumptions (LINE)
Limitations
- Correlation
Causation: SLR identifies association, not causation. Confounders may exist. - Sensitive to Outliers: Extreme points can disproportionately influence the slope and
. Check Cook's Distance. - Extrapolation is Dangerous: Predicting
for values outside the observed range is unreliable. - Omitted Variable Bias: If a relevant variable is missing,
may be biased.
Python Implementation
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
# 1. Prepare Data
X = df['Ad_Spend']
y = df['Sales']
X = sm.add_constant(X) # Add intercept
# 2. Fit Model
model = sm.OLS(y, X).fit()
# 3. Results
print(model.summary())
# 4. Visualization
plt.scatter(df['Ad_Spend'], df['Sales'], alpha=0.6)
plt.plot(df['Ad_Spend'], model.predict(X), color='red', linewidth=2)
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.title(f"SLR: R-squared = {model.rsquared:.3f}")
plt.show()
# 5. Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sm.graphics.plot_regress_exog(model, 'Ad_Spend', fig=fig)
plt.show()
R Implementation
# 1. Fit Model
model <- lm(Sales ~ Ad_Spend, data = df)
# 2. Results
summary(model)
# 3. Confidence Intervals for Coefficients
confint(model)
# 4. Diagnostics (Standard R Plots)
par(mfrow = c(2, 2))
plot(model)
# Plot 1: Residuals vs Fitted (Linearity, Homoscedasticity)
# Plot 2: Normal Q-Q (Normality of Residuals)
# Plot 3: Scale-Location (Homoscedasticity)
# Plot 4: Residuals vs Leverage (Influential Points)
# 5. Prediction
new_data <- data.frame(Ad_Spend = c(100, 200, 300))
predict(model, newdata = new_data, interval = "confidence")
Worked Numerical Example
Data: 10 months of data.
Result: Sales = 1000 + 2.5(Ad_Spend)
Interpretation:
- Intercept (1000): If Ad Spend is $0, we expect $1000 in baseline sales (brand awareness, organic).
- Slope (2.5): For every additional $1 spent on ads, Sales increase by $2.50 on average.
Prediction:
- If Ad Spend = $500:
- Sales = 1000 + 2.5(500) = 1000 + 1250 = $2250.
Interpretation Guide
| Output | Example Value | Interpretation | Edge Case Notes |
|---|---|---|---|
| 2.5 | Positive association. Y increases with X. | If 0, no linear relationship. | |
| -50 | Value of Y when X=0. | Negative sales? Impossible. Intercept may have no physical meaning if X=0 is outside data range. | |
| 0.001 | Slope |
Does not mean relationship is strong, just reliable. | |
| R-squared | 0.65 | 65% of variance in Sales is explained. | 35% is noise/error. |
| R-squared | 0.01 | X explains almost nothing about Y. | Check for non-linear U-shape! |
Common Pitfall Example
Scenario: Modeling child height vs age (Age 2-10).
Model: Height = 80cm + 6cm(Age). (Fits well for data).
Prediction Error:
- Predict height for a 30-year-old:
- H = 80 + 6(30) = 260cm (8.5 feet!).
Lesson: Linear trends rarely continue forever. Never extrapolate far outside the range of your training data.
Related Concepts
- Multiple Linear Regression - More than one predictor.
- Pearson Correlation - Related but different (Correlation
Slope). - Residual Analysis - Checking assumptions.
- Cook's Distance - Identifying influential observations.