habib's rabbit hole

CNN Build - Weight Init Rabbit Hole

Context — What I was trying to do

mid implementing cnn from scratch , got to weight init, hit the question which to use - He or Xavier, also why to use them in the first place and how ReLU is related to all of this? might sound naive, but was a new thing for me

Finding 1 - ReLU halves your invariance

let us say you have an input signal z that is a random variable and it has a symmetric distribution and centered at zero (z~𝒩(0,σ2)) , as we know that ReLU(x) = max (0,x) so the output will be just the positive side of the input as ReLU will act as a binary gate that will kill the negative input and since it is symmetrical distributed ~ 50% of the input signal will be gone from the output. so it means that as the variance is dropping by half at every layer the distribution is also shrinking towards zero. thus if the variance of the activation drops the gradients that will be calculated during the backprop will be very small -> no training at all. thus to encounter this halving effect He initialisation was proposed which said that instead of initializing weights with a variance of 1n (where n is the number of input nodes), He Initialization uses:Var(W)=2n

the "2" in the numerator is specifically designed to cancel out the "1/2" introduced by the ReLU, keeping the variance stable (near 1.0) across hundreds of layers.

why not Xavier

before He came into existence people used Xavier(Glorot) initialisation but the network still kept on dying in the deeper layers. this was because of the fact that :

xavier assumes that the activation functions were just a pass through for the sake of variance

look at the tanh function for instance :

tanh

it looks linear at 0,0 (straight line) and because of it being straight there is no change in the variance of the signal that passes through it. but when you swap tanh with ReLU which looks like a hinge and throws away 50% of the distribution.

if you use Xavier (Var[W]=1/n), the variance of the signal before the ReLU (the z value) is exactly 1.0.

Mathematically, the variance at layer L is:Var(L)=Var(0)·(12)L

why "He Initialization" wins:

He simply looked at that (1/2) and said

if the activation halves the variance, we will make the weights double it.

Xavier weight variance: 1n

He weight variance: 2n

by adding that 2 in the numerator, the forward pass math becomes:Var[z]=n·(2n)·Var[aprev]=2·Var[aprev]Var[a]=12·Var[z]=12·(2·Var[aprev])=Var[aprev]The 2 and the 1/2 cancel out perfectly.

the signal stays at 1.0 forever, no matter how deep the network is.