Day 22 - Preventing Overfitting using Regularization

Early stopping and batch normalization layers act as regularizers already.

L1 and L2 Regularization

L1 regularization helps obtain a sparse model (most weights equal to 0).

L2 regularization helps constrain network weights.

Regularizer can be set in Keras using the kernel_regularizer property.
- L1 regularizer - keras.regualrizers.l1()
- L2 regularizer - keras.regularizers.l2()
- L1 and L2 regularizer - keras.regularizers.l1_l2()

Dropout

At every training step, every neuron apart from the output neurons has a probability $p$ of being ignored during that step.

The hyperparameter $p$ is called the dropout rate and is typically set to 10% - 50%.

Neurons trained with dropout cannot co-adapt with their neighbor neurons and have to be as useful as possible on their own.

Since each of the N neurons in a network can be either present or absent, a total of $2^N$ networks are possible. The resulting network can be thought of as an average ensemble of all these networks.

In practice, dropout is only applied to top one to three layers (excluding output layer).

After training, each input connection weight needs to be multiplied by the keep probability ( $1-p$ ) to compensate for the input signal being larger (due to absence of dropout) during test time.

Alternatively, each neuron's output can be divided by the keep probability during training.

To regularize self-normalizing networks based on the SELU activation function, use alpha dropout which preserves the mean and standard deviation of its inputs.

Monte Carlo (MC) Dropout

Boosts the performance of any trained dropout model without having to retrain from scratch.

Gives a much better measure of the model's uncertainty.

Make $N$ predictions over the test set with training mode active (so that the Dropout layer is active) and average all the predictions.

Averaging over multiple predictions gives an estimate that is generally more reliable than a single prediction with dropout turned off.

The number of MC samples used, $N$ , is a hyperparameter. Larger N ⇒ more accurate predictions, higher inference time, and vice versa.

Max-Norm Regularization

For each neuron, constrain the weights of incoming connections such that their L2 norm is less than the max-norm hyperparmeter, $r$ .

Does not add a regularization loss term to the overall loss function. Instead, it is used after each training step to rescale the weights.

Can be used for weights using the kernel_constraint parameter, and with biases using the bias_constraint parameter.