The Chain Rule, unfolded
The premise
Backpropagation is not a special trick. It is the chain rule applied with bookkeeping — each derivative borrowing the work of the one in front of it.
Suppose you wrap functions like Russian dolls: L = f₅(f₄(f₃(f₂(f₁(w))))).
To know how much a tiny nudge in w changes L,
the chain rule gives you a single, neat product of local derivatives:
A neural network is exactly that: a stack of nested functions. The walk
below uses a 2-weight network (one hidden unit, one output unit, sigmoid
activations, squared loss) and steps you through the forward pass and
the backward pass — landing on ∂L/∂w₁, the gradient that
actually updates the deepest weight.
First, a refresher.
If your calculus is rusty, this section gets you back on your feet in five minutes. Skip it if you remember derivatives and the chain rule.
01What's a derivative, again?
A derivative is the slope of a function — how
much its output changes when you nudge its input. If
f'(x) = 3, then a tiny step Δx changes
f by about 3·Δx. Geometrically, it's the slope
of the line that just kisses the curve at that point — the tangent
line. Drag the slider:
Notice the slope is positive when the curve is going up, negative when it's going down, and zero at the bottom of the bowl. That last fact is what training a network is after — find weights where the loss has zero slope (a minimum).
02Two flavors of the letter "d"
You'll see both of these. They mean almost the same thing:
Used when f depends on a single variable.
"How does f change as x changes?"
Used when f depends on several variables. The ∂ ("partial d")
is a reminder: treat all the other variables as constants
while differentiating with respect to x. That's the only
difference.
Why this matters for us: a neural network's loss depends on
many weights. When we ask "how does the loss change if I nudge
w₁?" we hold all the other weights still — that's a partial
derivative. The next section makes that idea visual.
03Partial derivatives, visualized
Until now, our functions had one input. The functions in a neural network
have many: a loss like L(w₁, w₂, b₁, b₂) depends on
every weight at once. So when we write ∂L/∂w₁, we need a
precise way to answer "how does L change?" — and partial derivatives are
that answer.
Here's a function of two variables: f(x, y) = x² + 2y².
A function of one variable draws a curve; a function of two draws a
surface — picture a landscape with hills and
valleys. The heatmap below shows that surface from above (warmer color
= higher value, like a topographic map):
To find the slope of a surface at a point, you must specify which direction. There are two natural choices, one per axis:
∂f/∂x — change x, freeze y. Walk east–west.
∂f/∂y — change y, freeze x. Walk north–south.
Each choice flattens the surface into a 2D cross-section — and a partial derivative is just the ordinary derivative of that cross-section curve. The two panels below are exactly those cross-sections through the current point. Drag the heatmap and watch them update:
Notice the two cross-sections are different curves with different
slopes — that's the whole point. ∂f/∂x and
∂f/∂y are independent measurements: each tells you the
slope along just one axis.
The mechanical recipe for computing a partial derivative is exactly
what the symbol promises: treat the other variables
as constants and apply ordinary differentiation rules. For
f(x, y) = x² + 2y²:
Sanity check at (x, y) = (0.5, 0.3):
∂f/∂x = 2 · 0.5 = 1.0 and
∂f/∂y = 4 · 0.3 = 1.2. Match the readouts above? Good.
Back to the network: the loss L depends on every weight.
∂L/∂w₁ means: freeze every weight except w₁, then ask
how L changes as w₁ wiggles. That is precisely the slope we'll
multiply along the chain. Everything you saw on a curve still works —
the ∂ is just bookkeeping for "I'm holding the others still."
04The chain rule, in plain English
When functions are nested — the output of one
becomes the input to the next — their derivatives multiply.
If y = f(g(x)):
Read it as: "to get the slope of y in terms of x,
take the slope of f at the inner value, then multiply by
the slope of g." Each function in the chain hands its slope
down to the next. With n nested functions, you get n
slopes multiplied together. That's it. That's the entire mathematical
idea behind backpropagation.
05A tiny worked example
Let's compute dy/dx for y = (3x + 1)² at
x = 2. We could expand the square first, but let's use the
chain rule — same scaffolding we'll use on the network.
Sanity check: y(2) = 49. The chain rule says a tiny step
Δx = 0.01 should change y by about
42 · 0.01 = 0.42. The actual y(2.01) = 7.03² = 49.4209,
a change of 0.4209. The chain rule was right.
06Three rules you'll need below
That's almost everything. Three small derivative rules cover the rest:
The derivative of any number that doesn't depend on the variable is 0. This is why y drops out below — it's the target, just a number.
If you scale x by a constant a, the slope is just a. We'll use this for the weighted-sum parts of the network.
Bring the exponent down, drop the exponent by one, then multiply by the derivative of the inside. We'll use this on the squared loss.
Has a beautiful derivative: σ'(z) = σ(z)(1 − σ(z)). We'll derive this when we hit it — but the punchline is that we can compute it from the value we already have.
Backpropagation is exactly this — slope of the loss with respect to a weight — computed efficiently for every weight in the network at once.
Sigmoid σ(z)
State
Press “Step ▸” to begin
We will compute one prediction with the network, measure how wrong it is, and then walk the gradient back to the very first weight.
The chain, written out
Each box is a local derivative. The product is ∂L/∂w₁.
Boxes light up as the backward pass computes them.
Why it scales
Notice the trick: once you've computed ∂L/∂z₂, you reuse it
to get ∂L/∂a₁. Then ∂L/∂a₁ gets reused for
∂L/∂z₁. Each layer's gradient is just the next layer's
gradient times a local derivative. That is the entire idea of
backpropagation — a single right-to-left sweep, no matter how deep.
What's not shown
Real networks have many neurons per layer, so the local derivatives
become Jacobians and the products become matrix multiplications. But
the choreography is identical: forward to compute activations, backward
to compose gradients, then a single update step
w ← w − η · ∂L/∂w.