Day 16 - Gradient Clipping

A popular technique to mitigate exploding gradients problem.

Gradients during backpropagation are clipped so that they never exceed a given threshold.

The threshold is a hyperparameter that can be tuned.

Used mostly in RNNs since batch normalization is tricky to use. For other types of networks, batch normalization is sufficient.

Using Gradient Clipping in Keras

optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

Every component of gradient vector will be clipped between -1.0 and 1.0.

This may change orientation of gradient vector, however this approach works well in practice.

To retain direction of gradient vector, use clipnorm instead of clipvalue.

When using clipnorm, the whole gradient vector is clipped if its $l_2$ norm is greater than specified threshold.

Gradient Clipping in Practice

Track size of gradients during training using TensorBoard.

If gradients are exploding, try clipping by value and by norm.

Use both approaches with different thresholds and see which performs best on validation set.