Numerical Notebooks · No. 03 · Backpropagation

The Chain Rule, unfolded

13 steps forward · backward
one tiny network

The premise

Backpropagation is not a special trick. It is the chain rule applied with bookkeeping — each derivative borrowing the work of the one in front of it.

Suppose you wrap functions like Russian dolls: L = f₅(f₄(f₃(f₂(f₁(w))))). To know how much a tiny nudge in w changes L, the chain rule gives you a single, neat product of local derivatives:

Chain rule, in one line ∂L/∂w  =  ∂L/∂f₅ · ∂f₅/∂f₄ · ∂f₄/∂f₃ · ∂f₃/∂f₂ · ∂f₂/∂f₁ · ∂f₁/∂w

A neural network is exactly that: a stack of nested functions. The walk below uses a 2-weight network (one hidden unit, one output unit, sigmoid activations, squared loss) and steps you through the forward pass and the backward pass — landing on ∂L/∂w₁, the gradient that actually updates the deepest weight.

First, a refresher.

If your calculus is rusty, this section gets you back on your feet in five minutes. Skip it if you remember derivatives and the chain rule.

01What's a derivative, again?

A derivative is the slope of a function — how much its output changes when you nudge its input. If f'(x) = 3, then a tiny step Δx changes f by about 3·Δx. Geometrically, it's the slope of the line that just kisses the curve at that point — the tangent line. Drag the slider:

x 2.00
f(x) = ½x² − x + 1.5 1.500
f′(x) — the slope here 1.000
drag x
predict f(2.10) ≈ 1.500 + 1.000·0.10 = 1.600

Notice the slope is positive when the curve is going up, negative when it's going down, and zero at the bottom of the bowl. That last fact is what training a network is after — find weights where the loss has zero slope (a minimum).

02Two flavors of the letter "d"

You'll see both of these. They mean almost the same thing:

df/dx
Ordinary derivative

Used when f depends on a single variable. "How does f change as x changes?"

∂f/∂x
Partial derivative

Used when f depends on several variables. The ∂ ("partial d") is a reminder: treat all the other variables as constants while differentiating with respect to x. That's the only difference.

Why this matters for us: a neural network's loss depends on many weights. When we ask "how does the loss change if I nudge w₁?" we hold all the other weights still — that's a partial derivative. The next section makes that idea visual.

03Partial derivatives, visualized

Until now, our functions had one input. The functions in a neural network have many: a loss like L(w₁, w₂, b₁, b₂) depends on every weight at once. So when we write ∂L/∂w₁, we need a precise way to answer "how does L change?" — and partial derivatives are that answer.

Here's a function of two variables: f(x, y) = x² + 2y². A function of one variable draws a curve; a function of two draws a surface — picture a landscape with hills and valleys. The heatmap below shows that surface from above (warmer color = higher value, like a topographic map):

x0.50
y0.30
f(x, y)0.43

∂f/∂x+1.00
∂f/∂y+1.20
click anywhere on the heatmap, or use the sliders

To find the slope of a surface at a point, you must specify which direction. There are two natural choices, one per axis:

∂f/∂x  — change x, freeze y. Walk east–west.
∂f/∂y  — change y, freeze x. Walk north–south.

Each choice flattens the surface into a 2D cross-section — and a partial derivative is just the ordinary derivative of that cross-section curve. The two panels below are exactly those cross-sections through the current point. Drag the heatmap and watch them update:

slice 1  ·  vary x, y held at 0.30
slope of this curve at x  =  ∂f/∂x  =  +1.00
slice 2  ·  vary y, x held at 0.50
slope of this curve at y  =  ∂f/∂y  =  +1.20

Notice the two cross-sections are different curves with different slopes — that's the whole point. ∂f/∂x and ∂f/∂y are independent measurements: each tells you the slope along just one axis.

The mechanical recipe for computing a partial derivative is exactly what the symbol promises: treat the other variables as constants and apply ordinary differentiation rules. For f(x, y) = x² + 2y²:

∂f/∂x = 2x — differentiate x² (gives 2x); 2y² is a constant w.r.t. x, so its derivative is 0
∂f/∂y = 4y — differentiate 2y² (gives 4y); x² is a constant w.r.t. y, so its derivative is 0

Sanity check at (x, y) = (0.5, 0.3): ∂f/∂x = 2 · 0.5 = 1.0 and ∂f/∂y = 4 · 0.3 = 1.2. Match the readouts above? Good.

Back to the network: the loss L depends on every weight. ∂L/∂w₁ means: freeze every weight except w₁, then ask how L changes as w₁ wiggles. That is precisely the slope we'll multiply along the chain. Everything you saw on a curve still works — the ∂ is just bookkeeping for "I'm holding the others still."

04The chain rule, in plain English

When functions are nested — the output of one becomes the input to the next — their derivatives multiply. If y = f(g(x)):

Chain rule dy/dx  =  (df/dg)  ·  (dg/dx)

Read it as: "to get the slope of y in terms of x, take the slope of f at the inner value, then multiply by the slope of g." Each function in the chain hands its slope down to the next. With n nested functions, you get n slopes multiplied together. That's it. That's the entire mathematical idea behind backpropagation.

05A tiny worked example

Let's compute dy/dx for y = (3x + 1)² at x = 2. We could expand the square first, but let's use the chain rule — same scaffolding we'll use on the network.

Name the inner piece. Let u = 3x + 1. Then y = u².
dy/du = 2u  (power rule: d/du of u² is 2u)
du/dx = 3  (d/dx of 3x is 3; d/dx of the constant 1 is 0)
Apply the chain. dy/dx = (dy/du) · (du/dx) = 2u · 3 = 6u = 6(3x + 1)
Plug in x = 2.  dy/dx = 6 · 7 = 42

Sanity check: y(2) = 49. The chain rule says a tiny step Δx = 0.01 should change y by about 42 · 0.01 = 0.42. The actual y(2.01) = 7.03² = 49.4209, a change of 0.4209. The chain rule was right.

06Three rules you'll need below

That's almost everything. Three small derivative rules cover the rest:

(c)' = 0
Constant rule

The derivative of any number that doesn't depend on the variable is 0. This is why y drops out below — it's the target, just a number.

(a·x)' = a
Linear rule

If you scale x by a constant a, the slope is just a. We'll use this for the weighted-sum parts of the network.

(uⁿ)' = n·uⁿ⁻¹·u'
Power rule (with chain)

Bring the exponent down, drop the exponent by one, then multiply by the derivative of the inside. We'll use this on the squared loss.

σ(z) = 1/(1+e⁻ᶻ)
Sigmoid identity

Has a beautiful derivative: σ'(z) = σ(z)(1 − σ(z)). We'll derive this when we hit it — but the punchline is that we can compute it from the value we already have.

Backpropagation is exactly this — slope of the loss with respect to a weight — computed efficiently for every weight in the network at once.

Sigmoid σ(z)

State

SETUP

Press “Step ▸” to begin

We will compute one prediction with the network, measure how wrong it is, and then walk the gradient back to the very first weight.

x = 1.0  |  w₁ = 0.5  |  w₂ = 0.8  |  b₁ = b₂ = 0  |  y = 1.0

The chain, written out

Each box is a local derivative. The product is ∂L/∂w₁. Boxes light up as the backward pass computes them.

∂L/∂w₁  = 
∂L/∂a₂
·
∂a₂/∂z₂
·
∂z₂/∂a₁
·
∂a₁/∂z₁
·
∂z₁/∂w₁

Why it scales

Notice the trick: once you've computed ∂L/∂z₂, you reuse it to get ∂L/∂a₁. Then ∂L/∂a₁ gets reused for ∂L/∂z₁. Each layer's gradient is just the next layer's gradient times a local derivative. That is the entire idea of backpropagation — a single right-to-left sweep, no matter how deep.

What's not shown

Real networks have many neurons per layer, so the local derivatives become Jacobians and the products become matrix multiplications. But the choreography is identical: forward to compute activations, backward to compose gradients, then a single update step w ← w − η · ∂L/∂w.