Day 22 - Preventing Overfitting using Regularization
Early stopping and batch normalization layers act as regularizers already.
L1 and L2 Regularization
- L1 regularization helps obtain a sparse model (most weights equal to 0).
- L2 regularization helps constrain network weights.
- Regularizer can be set in Keras using the
kernel_regularizer
property.- L1 regularizer -
keras.regualrizers.l1()
- L2 regularizer -
keras.regularizers.l2()
- L1 and L2 regularizer -
keras.regularizers.l1_l2()
- L1 regularizer -
Dropout
- At every training step, every neuron apart from the output neurons has a probability of being ignored during that step.
- The hyperparameter is called the dropout rate and is typically set to 10% - 50%.
- Neurons trained with dropout cannot co-adapt with their neighbor neurons and have to be as useful as possible on their own.
- Since each of the N neurons in a network can be either present or absent, a total of networks are possible. The resulting network can be thought of as an average ensemble of all these networks.
- In practice, dropout is only applied to top one to three layers (excluding output layer).
- After training, each input connection weight needs to be multiplied by the keep probability () to compensate for the input signal being larger (due to absence of dropout) during test time.
- Alternatively, each neuron's output can be divided by the keep probability during training.
- To regularize self-normalizing networks based on the SELU activation function, use alpha dropout which preserves the mean and standard deviation of its inputs.
Monte Carlo (MC) Dropout
- Boosts the performance of any trained dropout model without having to retrain from scratch.
- Gives a much better measure of the model's uncertainty.
- Make predictions over the test set with training mode active (so that the Dropout layer is active) and average all the predictions.
- Averaging over multiple predictions gives an estimate that is generally more reliable than a single prediction with dropout turned off.
- The number of MC samples used, , is a hyperparameter. Larger N ⇒ more accurate predictions, higher inference time, and vice versa.
Max-Norm Regularization
- For each neuron, constrain the weights of incoming connections such that their L2 norm is less than the max-norm hyperparmeter, .
- Does not add a regularization loss term to the overall loss function. Instead, it is used after each training step to rescale the weights.
- Can be used for weights using the
kernel_constraint
parameter, and with biases using thebias_constraint
parameter.