Automatic Differentiation - An Introduction

The paper establishes AD as distinct differentiation method as compared to other methods like that of manual, numerical, symbolic differentiation.

AD computes derivatives by applying chain rule systematically to elementary operations in a computer program and achieves machine level precision without any approximation error of numerical methods.

It can be put like this:

Symbolic differentiation can get messy.
Numerical differentiation (finite differences) is approximate and suffers from floating-point errors.
Automatic differentiation gives exact derivatives up to machine precision, efficiently.

There are two primary modes of this - forward and backward mode.

Since we are explaining it in the most basic terms let us use a story!

Let us say we have a machine that takes in 2 inputs - x and y. The machine has to do the following:

f (x, y) = (x + y) (x \cdot y)

Now looking at the above let us assume that our machine is made up of three parts :

Gear A: That is responsible for adding the two inputs and produce u1
Gear B: That is responsible for multiplying the two inputs and produce u2
Gear C: Responsible for multiplying the two (u1 and u2) and then giving out the final result as the output

Let for the simplicity x = 2 and y = 3

Forward Pass:

Gear A comes into the picture:

u 1 = x + y = 2 + 3 = 5

Gear B comes into the pictures:

u 2 = x * y = 2 * 3 = 6

Finally Gear C comes into play to produce the final output:

f = u 1 * u 2 = 5 * 6 = 30

The machine in its forward pass has produced 30 as its final output!

Backward Pass: (Backpropagation)

Think of it like this, the final output f asks its subjects - the inputs - x and y, how much did each of you contribute to me becoming 30?

Now this message will travel backwards from gear C to gear A and B.

For gear C the equation was like this:

f = u 1 * u 2

C knows that if u1 changes then f changes by u2 = 6 (basic chain rule):

\frac{\partial f}{\partial u_{1}} = 6

Similarly if u2 changes then f changes by u1 = 5:

\frac{\partial f}{\partial u_{2}} = 5

So the Gear C will send this message to its previous gears A and B:

Message to A : "Your importance is 6." Message to B : "Your importance is 5."

Now its time for Gear A!

Gear A knows that

u 1 = x + y

Gear A knows that if x changes by 1 then u1 also changes by 1 and same for y, if it changes by 1 then also u1 changes by 1.

So the total contributions will be as follows:

\frac{\partial f}{\partial x} + = 6 \cdot 1 = 6

\frac{\partial f}{\partial y} + = 6 \cdot 1 = 6

Now coming to Gear B:

Gear B knows that:

u 2 = x * y

If x changes by 1 then u2 changes by y = 3 and if y changes by 1 then u2 changes by x = 2.

So the total contributions are as follows:

\frac{\partial f}{\partial x} + = 5 \cdot 3 = 15

\frac{\partial f}{\partial y} + = 5 \cdot 2 = 10

Now for the final contributions:

\frac{\partial f}{\partial x} = 6 + 15 = 21

\frac{\partial f}{\partial y} = 6 + 10 = 16

Conclusions:

By going through all of this we came to know each function can be broken down into small operations and the forward pass computes the intermediate values of all the steps that comes in between. The backward pass computes the gradients by propagating values backwards using the chain rule. The chain rule is the core of everything. Each input variable has some contribution to the output along all the paths and that's why in backpropagation we sum the contributions from the different paths.