Day 14 - Nonsaturating Activation Functions
- One of the reasons for unstable gradients is a poor choice of activation function. There are other activation functions that perform much better than the Sigmoid.
Rectified Linear Unit (ReLU)
- 
- Advantages: fast to compute, does not saturate for positive values.
- Disadvantages: suffers from dying ReLUs.
Dying ReLUs
- ReLUs that output 0 always.
- Can happen with a large learning rate.
- Occurs when weights are tweaked so that weighted sum of iniputs is negative for all instances in training set. As a result, gradient is zero and gradient descent has no effect.
Using ReLU in Keras
keras.layers.Dense(50, activation="relu", kernel_initializer="he_normal")Leaky ReLU
- 
- There is a small slope when  so neurons never die. Training can slow down if sum of inputs is less than 0, but it never completely stops.
- In practice, a higher value of  results in better performance. Keras uses default value of .
Using Leaky ReLU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.LeakyReLU(alpha=0.2),
...Randomized Leaky ReLU (RReLU)
-  is picked randomly from a given range during training, and fixed to an average value during testing.
- RReLU seems too act like a regularizer, reducing risk of overfitting on the training set.
Using Leaky ReLU in Keras
- There is no official implementation in Keras for Leaky ReLU.
- A custom layer can be created to implement it.
Parameterized Leaky ReLU (PReLU)
-  is learned via backpropagation during training instead of being a hyperparameter.
- Outperforms ReLU on large image datasets.
- Runs risk of overfitting on smaller datasets.
Using PReLU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.PReLU(),
...Exponential Linear Unit (ELU)
- Takes on negative values when , so average output is closer to 0. This helps alleviate the vanishing gradient problem.
- There is also a non-zero gradient when  to prevent dead neurons.
-  defines the value the function approaches when  is highly negative.
- Usually  is set to 1, but it can be tuned like any other hyperparameter.
- When , the function is smooth everywhere (including around ). This helps speed up gradient descent since the function doesn't bounce much to the left and right of .
- Advantages: reduced training time and better performance on test set.
- Disadvantage: slower to compute than ReLU. Faster convergence during training compensates for this, but it will still be slower than ReLU at test time.
Using ELU in Keras
...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.ELU(),
...Scaled Exponential Linear Unit (SELU)
- 
            Makes network self-normalize (i.e. every layer's output
            maintains
            ) and therefore not suffer from unstable gradients IF:
            - Input features are standardized ().
 - Network architecture is sequential (no recurrent or skip connections).
 - All layers are dense and initialized with LeCun normal initialization.
 
- Research has shown SELU can improve performance in CNNs as well.
Using SELU in Keras
keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")Rules of thumb
- 
            Order of preference of activation functions
            - SELU (if conditions for self-normalization are met)
 - ELU
 - Leaky ReLU (and its variants)
 - ReLU
 - Hyperbolic Tangent
 - Logistic
 
- For low runtime latency, use Leaky ReLU
- 
            If extra time and computing power are available, use
            cross-validation to evaluate:
            - RReLU if network is overfitting
 - PReLU if a large training set is available