Day 14 - Nonsaturating Activation Functions
- One of the reasons for unstable gradients is a poor choice of activation function. There are other activation functions that perform much better than the Sigmoid.
Rectified Linear Unit (ReLU)
-
- Advantages: fast to compute, does not saturate for positive values.
- Disadvantages: suffers from dying ReLUs.
Dying ReLUs
- ReLUs that output 0 always.
- Can happen with a large learning rate.
- Occurs when weights are tweaked so that weighted sum of iniputs is negative for all instances in training set. As a result, gradient is zero and gradient descent has no effect.
Using ReLU in Keras
keras.layers.Dense(50, activation="relu", kernel_initializer="he_normal")
Leaky ReLU
-
- There is a small slope when so neurons never die. Training can slow down if sum of inputs is less than 0, but it never completely stops.
- In practice, a higher value of results in better performance. Keras uses default value of .
Using Leaky ReLU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.LeakyReLU(alpha=0.2),
...
Randomized Leaky ReLU (RReLU)
- is picked randomly from a given range during training, and fixed to an average value during testing.
- RReLU seems too act like a regularizer, reducing risk of overfitting on the training set.
Using Leaky ReLU in Keras
- There is no official implementation in Keras for Leaky ReLU.
- A custom layer can be created to implement it.
Parameterized Leaky ReLU (PReLU)
- is learned via backpropagation during training instead of being a hyperparameter.
- Outperforms ReLU on large image datasets.
- Runs risk of overfitting on smaller datasets.
Using PReLU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.PReLU(),
...
Exponential Linear Unit (ELU)
- Takes on negative values when , so average output is closer to 0. This helps alleviate the vanishing gradient problem.
- There is also a non-zero gradient when to prevent dead neurons.
- defines the value the function approaches when is highly negative.
- Usually is set to 1, but it can be tuned like any other hyperparameter.
- When , the function is smooth everywhere (including around ). This helps speed up gradient descent since the function doesn't bounce much to the left and right of .
- Advantages: reduced training time and better performance on test set.
- Disadvantage: slower to compute than ReLU. Faster convergence during training compensates for this, but it will still be slower than ReLU at test time.
Using ELU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.ELU(),
...
Scaled Exponential Linear Unit (SELU)
-
Makes network self-normalize (i.e. every layer's output
maintains
) and therefore not suffer from unstable gradients IF:
- Input features are standardized ().
- Network architecture is sequential (no recurrent or skip connections).
- All layers are dense and initialized with LeCun normal initialization.
- Research has shown SELU can improve performance in CNNs as well.
Using SELU in Keras
keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")
Rules of thumb
-
Order of preference of activation functions
- SELU (if conditions for self-normalization are met)
- ELU
- Leaky ReLU (and its variants)
- ReLU
- Hyperbolic Tangent
- Logistic
- For low runtime latency, use Leaky ReLU
-
If extra time and computing power are available, use
cross-validation to evaluate:
- RReLU if network is overfitting
- PReLU if a large training set is available