Day 12 - Vanishing and Exploding Gradients
- Deep neural network suffer from unstable gradients - different layers learn at widely different speeds due to vanishing gradients or exploding gradients.
The Vanishing Gradients Problem
- In feedforward deep networks, the gradients keep getting smaller as the algorithm progresses to lower layers.
- As a result, the connection weights of lower layers are left virtually unchanged and training never converges to a good solution.
- Glorot & Benjio [1] showed that using the sigmoid activation function and a normal distribution (with mean = 0, standard deviation = 1) to initialize the weights results in greater variance in the output of each layer compared to the input.
- The variance keeps increasing through the layers until the activation function saturates in the top layer.
- With the sigmoid function, saturation occurs at 0 (on the lower end) and 1 (on the upper end), and the derviative is very close to zero. As a result, a small gradient propagates through the network and it keeps getting diluted.
- The sigmoid function having a mean of 0.5 makes it worse. The hyperbolic tangent function that has a mean of 0 performs slightly better.
The Exploding Gradients Problem
- In recurrent neural networks, the gradients keep growing bigger.
- As a result, layers get a very large update and the algorithm actually ends up diverging.
Methods to Deal With Unstable Gradients
- Glorot & He initialization
- Use non-saturating activation functions
- Batch normalization
- Gradient clipping
References
[1] Glorot, Xavier & Bengio, Y.. (2010). Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research - Proceedings Track. 9. 249-256.