Day 13 - Glorot & He Initialization
-
To alleviate the problem of unstable gradients, the signal needs to
flow properly in both directions:
- in the forward direction when making predictions.
- in the reverse direction when backpropagating gradients.
-
For the signal to flow properly:
- Variance of inputs and outputs of each layer should be equal.
- Gradients should have equal variance before and after flowing through each layer in the reverse direction.
- It is not possible to guarantee both of the above, unless a layer has an equal number of inputs (fan-in) and neurons (fan-out).
- In practice, using the following strategies to initialize the weights of each layer helps
Weight initialization strategies
Activation function used | Name | Strategy |
---|---|---|
Logistic, Softmax, Tanh, No activation function (in case of regression) | Glorot |
A normal distribution with
A uniform distribution between and , with where |
ReLU and its variants | He | A normal distribution with |
SELU | LeCun | A normal distribution with |
- In all cases above, to use uniform distribution, calculate .
- Keras uses Glorot initialization with uniform distribution by default.
-
To use He initialization with normal distribution:
keras.layers.Dense(30, activation="relu", kernel_initializer="he_normal")
-
To use
He initialization with uniform distribution based
on
instead of
:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode="fan_avg", distribution="uniform")
keras.layers.Dense(... kernel_initializer=he_avg_init)