Day 13 - Glorot & He Initialization

To alleviate the problem of unstable gradients, the signal needs to flow properly in both directions:
- in the forward direction when making predictions.
- in the reverse direction when backpropagating gradients.

For the signal to flow properly:
- Variance of inputs and outputs of each layer should be equal.
- Gradients should have equal variance before and after flowing through each layer in the reverse direction.

It is not possible to guarantee both of the above, unless a layer has an equal number of inputs (fan-in) and neurons (fan-out).

In practice, using the following strategies to initialize the weights of each layer helps

Weight initialization strategies

Activation function used	Name	Strategy
Logistic, Softmax, Tanh, No activation function (in case of regression)	Glorot	A normal distribution with $\mu = 0, \sigma^2 = \frac{1}{fan_{avg}}$ A uniform distribution between $-r$ and $+r$ , with $r = \sqrt{\frac{3}{fan_{avg}}}$ where $fan_{avg} = \frac{{f_{in} + f_{out}}}{2}$
ReLU and its variants	He	A normal distribution with $\mu = 0, \sigma^2 = \frac{2}{fan_{in}}$
SELU	LeCun	A normal distribution with $\mu = 0, \sigma^2 = \frac{1}{fan_{in}}$

In all cases above, to use uniform distribution, calculate $r = \sqrt{3 \sigma^2}$ .

Keras uses Glorot initialization with uniform distribution by default.

To use He initialization with normal distribution:
keras.layers.Dense(30, activation="relu", kernel_initializer="he_normal")

To use He initialization with uniform distribution based on $fan_{avg}$ instead of $fan_{in}$ :
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode="fan_avg", distribution="uniform")
keras.layers.Dense(... kernel_initializer=he_avg_init)