Day 12 - Vanishing and Exploding Gradients

Deep neural network suffer from unstable gradients - different layers learn at widely different speeds due to vanishing gradients or exploding gradients.

The Vanishing Gradients Problem

In feedforward deep networks, the gradients keep getting smaller as the algorithm progresses to lower layers.

As a result, the connection weights of lower layers are left virtually unchanged and training never converges to a good solution.

Glorot & Benjio [1] showed that using the sigmoid activation function and a normal distribution (with mean = 0, standard deviation = 1) to initialize the weights results in greater variance in the output of each layer compared to the input.

The variance keeps increasing through the layers until the activation function saturates in the top layer.

With the sigmoid function, saturation occurs at 0 (on the lower end) and 1 (on the upper end), and the derviative is very close to zero. As a result, a small gradient propagates through the network and it keeps getting diluted.

The sigmoid function having a mean of 0.5 makes it worse. The hyperbolic tangent function that has a mean of 0 performs slightly better.

The Exploding Gradients Problem

In recurrent neural networks, the gradients keep growing bigger.

As a result, layers get a very large update and the algorithm actually ends up diverging.

Methods to Deal With Unstable Gradients

Glorot & He initialization

Use non-saturating activation functions

Batch normalization

Gradient clipping

References

[1] Glorot, Xavier & Bengio, Y.. (2010). Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research - Proceedings Track. 9. 249-256.