Gradient Descent
Gradient Descent
Definition
Core Statement
Gradient Descent is an iterative first-order optimization algorithm used to find a local minimum of a differentiable function. It takes steps proportional to the negative of the gradient (slope) of the function at the current point.
Purpose
- Minimize Loss: Used to train ML models (Linear Regression, Neural Networks) by minimizing Mean Squared Error or Cross-Entropy.
- Parameter Tuning: Adjusts weights and biases until the model fits the data.
Intuition: The Hiker in the Fog
Imagine you are on a mountain at night (foggy). You want to reach the lowest valley (minimum loss).
- Check Slope: You feel the ground with your foot to see which way is "down". (Calculate Gradient).
- Take a Step: You take a step in the downhill direction. (Update weights).
- Step Size: If you take tiny steps, you'll never get there. If you jump, you might fall off a cliff. (Learning Rate).
- Repeat: Keep doing this until the ground is flat (Convergence).
Key Parameters
1. Learning Rate ( )
The size of the step.
- Too Small: Convergence takes forever.
- Too Large: You overshoot the minimum and diverge (explode).
2. The Gradient ( )
The vector of partial derivatives. It points "uphill". We subtract it to go "downhill".
Variants
| Variant | Description | Pros | Cons |
|---|---|---|---|
| Batch GD | Uses all data for one step. | Stable convergence. | Slow; memory intensive. |
| Stochastic GD (SGD) | Uses one random sample per step. | Fast; escapes local minima. | Noisy/Jittery path. |
| Mini-Batch GD | Uses a batch (e.g., 32 samples). | Best of both worlds. | Requires tuning batch size. |
Limitations & Pitfalls
Pitfalls
- Local Minima: In non-convex functions (Neural Nets), you might get stuck in a small valley, not the deepest one (Global Minimum). Motivation for momentum/Adam.
- Saddle Points: Points where slope is zero but it's not a minimum (flat plateau).
- Scaling: If features are on different scales (Age vs Income), the gradient path is a narrow ravine and descent is slow. Always feature scale.
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
# Comparison: Manual GD for y = x^2 (Min is at 0)
def cost_func(x):
return x**2
def gradient(x):
return 2*x # Derivative of x^2
x_start = 10
learning_rate = 0.1
n_iterations = 20
x_path = [x_start]
x = x_start
for i in range(n_iterations):
grad = gradient(x)
x = x - learning_rate * grad
x_path.append(x)
print(f"Final x: {x:.4f}") # Should be near 0
plt.plot(x_path, 'o-')
plt.title("Path to Minimum")
plt.xlabel("Iteration")
plt.ylabel("Value of x")
plt.show()
Related Concepts
- Backpropagation - Using chain rule to calculate gradients in Neural Nets.
- Loss Function - The function
we are minimizing. - Neural Networks - Heavy users of GD.
- Feature Scaling - Critical pre-requisite.