Understanding Gradient Descent Visually

An interactive visual guide to gradient descent — the optimization algorithm that powers modern machine learning.

machine-learningoptimizationdeep-learningvisualization

Understanding Gradient Descent Visually

Every neural network you have ever used was trained by gradient descent. GPT, Stable Diffusion, AlphaFold, your company's fraud detector. All of them. The algorithm is old (Cauchy proposed it in 1847) and the core idea fits in one sentence: evaluate the slope of your loss function, then step downhill. Repeat until the loss stops decreasing.

That sentence hides a surprising amount of subtlety. The step size matters enormously. The landscape is rarely a clean bowl. And a naive implementation will crawl through flat regions while oscillating wildly through narrow valleys. This post builds intuition for all three problems with interactive charts you can explore yourself.

The Loss Landscape

Start with the simplest case. A single parameter θ\theta, a quadratic loss J(θ)=θ2J(\theta) = \theta^2, and the minimum sitting at θ=0\theta = 0. The gradient is 2θ2\theta, so the update rule becomes:

θnew=θoldα2θold\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot 2\theta_{\text{old}}

where α\alpha is the learning rate. That is it. One multiplication, one subtraction, and θ\theta moves closer to zero.

The interesting question is how it moves. With α=0.3\alpha = 0.3 and a starting point of θ=4\theta = 4, the first step is massive. The gradient at θ=4\theta = 4 is 8, so θ\theta jumps to 1.6 in a single update. The second step lands at 0.64. By step four, the parameter is already within 0.3 of the optimum.

This is the property that makes gradient descent work at all: step size scales naturally with steepness. Far from the minimum, gradients are large, so the optimizer covers ground fast. Near the minimum, gradients shrink, so the optimizer settles gently rather than overshooting. On a smooth convex surface like this one, convergence is guaranteed for any sufficiently small learning rate.

The real world is not this clean. Loss landscapes in deep learning have saddle points, flat plateaus, and sharp ravines. But the fundamental mechanism stays the same.

The Learning Rate Dilemma

α\alpha is the single most consequential hyperparameter in gradient descent. Get it wrong by a factor of two and you either waste a week of GPU time or blow up your training run entirely.

The failure modes are asymmetric. A learning rate that is too small will still converge, eventually. You burn compute, but you get an answer. A learning rate that is too large can diverge permanently. Each update overshoots the minimum, landing on a steeper part of the curve, which produces an even larger gradient, which causes an even bigger overshoot. The loss climbs exponentially until it hits infinity or NaN.

The chart below shows three training runs on the same quadratic loss, all starting at θ=4\theta = 4. Same function, same initialization, three very different outcomes.

Look at the too-slow run (α=0.01\alpha = 0.01). After 20 full iterations it has only dropped from 16 to 7. At this rate, reaching a loss below 0.01 would take over 300 iterations. The well-tuned run (α=0.1\alpha = 0.1) gets there in about 15. The diverging run (α=1.1\alpha = 1.1) never gets anywhere useful. Loss doubles in the first step and triples by the third.

This is why learning rate schedules exist. Start with a moderately large α\alpha to make fast initial progress, then decay it as you approach convergence. Cosine annealing, step decay, and warmup-then-decay are all strategies for navigating this tradeoff. Adaptive optimizers like Adam take a different approach: they maintain per-parameter learning rates that adjust automatically based on gradient history.

Adding Momentum

Vanilla gradient descent treats each step in isolation. The update at step t depends only on the gradient at step t, with no memory of what came before. This creates a problem in loss landscapes with long, narrow valleys. The optimizer oscillates back and forth across the narrow dimension while making slow progress along the long one.

Momentum fixes this by giving the optimizer a memory. Instead of using the raw gradient directly, you maintain a velocity vector that accumulates gradients over time:

vt=βvt1+αJ(θ)v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla J(\theta)

θ=θvt\theta = \theta - v_t

β\beta controls the decay rate, typically between 0.5 and 0.9. When successive gradients point in the same direction, velocity builds up and the optimizer accelerates along that axis. When gradients alternate in sign, their contributions cancel and the oscillation dampens. The physics analogy is literal: this is a ball rolling downhill with friction.

The chart below compares vanilla gradient descent against momentum (β=0.5\beta = 0.5) on our quadratic loss, both using α=0.05\alpha = 0.05 and starting at θ=4\theta = 4.

By iteration 15, the momentum variant has reached θ=0.04\theta = 0.04. Vanilla GD is still at 0.74. That is roughly an 18x difference in remaining error, from adding one line of code to the update rule.

The momentum line also reveals the characteristic tradeoff: a slight overshoot around iteration 19 where θ\theta dips below zero. The accumulated velocity carries the parameter past the minimum before the gradient pulls it back. In practice this is harmless. The correction is fast, and the convergence speedup more than compensates.

Momentum becomes even more valuable in higher dimensions. Real loss surfaces have directions of high curvature (where gradients are large and updates oscillate) and directions of low curvature (where gradients are small and progress stalls). Momentum suppresses the oscillation and amplifies the slow progress simultaneously. This is why every serious optimizer, from SGD with momentum to Adam to AdaGrad, incorporates some form of gradient accumulation.

Key Takeaways

  • Gradient descent is iterative slope-following. The gradient gives direction, the learning rate gives step size. On a convex surface with a well-chosen α\alpha, convergence is guaranteed.
  • The learning rate is the most dangerous hyperparameter. Too small wastes compute. Too large destroys the training run entirely. Adaptive optimizers and learning rate schedules exist because picking a single fixed α\alpha that works for an entire training run is nearly impossible on real problems.
  • Momentum gives the optimizer memory. It accelerates convergence in low-curvature directions and dampens oscillation in high-curvature directions. The cost is a single extra hyperparameter (β\beta) and one extra vector of state.
  • These 1D examples transfer directly to models with millions of parameters. The geometry is harder to visualize, but the dynamics are identical. If you understand why α=1.1\alpha = 1.1 diverges on a parabola, you understand why a bad learning rate will blow up a transformer.