habib's blog

Automatic Differentiation - An Introduction

The paper establishes AD as distinct differentiation method as compared to other methods like that of manual, numerical, symbolic differentiation.

AD computes derivatives by applying chain rule systematically to elementary operations in a computer program and achieves machine level precision without any approximation error of numerical methods.

It can be put like this:

There are two primary modes of this - forward and backward mode.

Since we are explaining it in the most basic terms let us use a story!

Let us say we have a machine that takes in 2 inputs - x and y. The machine has to do the following:

f(x,y)=(x+y)(x·y)

Now looking at the above let us assume that our machine is made up of three parts :

Let for the simplicity x = 2 and y = 3

Forward Pass:

  1. Gear A comes into the picture:
u1=x+y=2+3=5
  1. Gear B comes into the pictures:
u2=x*y=2*3=6
  1. Finally Gear C comes into play to produce the final output:
f=u1*u2=5*6=30

The machine in its forward pass has produced 30 as its final output!

Backward Pass: (Backpropagation)

Think of it like this, the final output f asks its subjects - the inputs - x and y, how much did each of you contribute to me becoming 30?

Now this message will travel backwards from gear C to gear A and B.

For gear C the equation was like this:

f=u1*u2

C knows that if u1 changes then f changes by u2 = 6 (basic chain rule):

fu1=6

Similarly if u2 changes then f changes by u1 = 5:

fu2=5

So the Gear C will send this message to its previous gears A and B:

Message to A : "Your importance is 6." Message to B : "Your importance is 5."

Now its time for Gear A!

Gear A knows that

u1=x+y

Gear A knows that if x changes by 1 then u1 also changes by 1 and same for y, if it changes by 1 then also u1 changes by 1.

So the total contributions will be as follows:

fx+=6·1=6fy+=6·1=6

Now coming to Gear B:

Gear B knows that:

u2=x*y

If x changes by 1 then u2 changes by y = 3 and if y changes by 1 then u2 changes by x = 2.

So the total contributions are as follows:

fx+=5·3=15fy+=5·2=10

Now for the final contributions:

fx=6+15=21fy=6+10=16

Conclusions:

By going through all of this we came to know each function can be broken down into small operations and the forward pass computes the intermediate values of all the steps that comes in between. The backward pass computes the gradients by propagating values backwards using the chain rule. The chain rule is the core of everything. Each input variable has some contribution to the output along all the paths and that's why in backpropagation we sum the contributions from the different paths.