Day 14 - Nonsaturating Activation Functions

One of the reasons for unstable gradients is a poor choice of activation function. There are other activation functions that perform much better than the Sigmoid.

Rectified Linear Unit (ReLU)

$\text{ReLU}(z) = \max(0, z)$

Advantages: fast to compute, does not saturate for positive values.

Disadvantages: suffers from dying ReLUs.

Dying ReLUs

ReLUs that output 0 always.

Can happen with a large learning rate.

Occurs when weights are tweaked so that weighted sum of iniputs is negative for all instances in training set. As a result, gradient is zero and gradient descent has no effect.

Using ReLU in Keras

keras.layers.Dense(50, activation="relu", kernel_initializer="he_normal")

Leaky ReLU

$\text{LeakyReLU}(z) = \max(\alpha z, z)$

There is a small slope when $z < 0$ so neurons never die. Training can slow down if sum of inputs is less than 0, but it never completely stops.

In practice, a higher value of $\alpha$ results in better performance. Keras uses default value of $\alpha = 0.3$ .

Using Leaky ReLU in Keras

...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.LeakyReLU(alpha=0.2),
...

Randomized Leaky ReLU (RReLU)

$\alpha$ is picked randomly from a given range during training, and fixed to an average value during testing.

RReLU seems too act like a regularizer, reducing risk of overfitting on the training set.

Using Leaky ReLU in Keras

There is no official implementation in Keras for Leaky ReLU.

A custom layer can be created to implement it.

Parameterized Leaky ReLU (PReLU)

$\alpha$ is learned via backpropagation during training instead of being a hyperparameter.

Outperforms ReLU on large image datasets.

Runs risk of overfitting on smaller datasets.

Using PReLU in Keras

...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.PReLU(),
...

Exponential Linear Unit (ELU)

$\text{ELU}_{\alpha}(z) = \begin{cases} \alpha(e^z-1), & \text{if } z<0,\\ z, & \text{if } z >= 0. \end{cases}$

Takes on negative values when $z < 0$ , so average output is closer to 0. This helps alleviate the vanishing gradient problem.

There is also a non-zero gradient when $z<0$ to prevent dead neurons.

$\alpha$ defines the value the function approaches when $z$ is highly negative.

Usually $\alpha$ is set to 1, but it can be tuned like any other hyperparameter.

When $\alpha=1$ , the function is smooth everywhere (including around $z=0$ ). This helps speed up gradient descent since the function doesn't bounce much to the left and right of $z=0$ .

Advantages: reduced training time and better performance on test set.

Disadvantage: slower to compute than ReLU. Faster convergence during training compensates for this, but it will still be slower than ReLU at test time.

Using ELU in Keras

...
keras.layers.Dense(10, kernel_initializer="he_normal"),
keras.layers.ELU(),
...

Scaled Exponential Linear Unit (SELU)

Makes network self-normalize (i.e. every layer's output maintains $\mu=0, \sigma=1$ ) and therefore not suffer from unstable gradients IF:
- Input features are standardized ( $\mu=0, \sigma=1$ ).
- Network architecture is sequential (no recurrent or skip connections).
- All layers are dense and initialized with LeCun normal initialization.

Research has shown SELU can improve performance in CNNs as well.

Using SELU in Keras

keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

Rules of thumb

Order of preference of activation functions
1. SELU (if conditions for self-normalization are met)
1. ELU
1. Leaky ReLU (and its variants)
1. ReLU
1. Hyperbolic Tangent
1. Logistic

For low runtime latency, use Leaky ReLU

If extra time and computing power are available, use cross-validation to evaluate:
- RReLU if network is overfitting
- PReLU if a large training set is available