Day 20 - Adaptive Learning Rates

Adaptive Learning Rate: the effective learning rate keeps being adjusted as training progresses.

AdaGrad

The learning rate is decayed faster for steeper dimensions than for dimensions with gentler slopes.
$s \larr s + \nabla_{\theta}J(\theta) \otimes \nabla_{\theta}J(\theta)$
$\theta \larr \theta - \eta \nabla_{\theta}J(\theta) \oslash \sqrt{s + \varepsilon}$

Advantages
- Points resulting gradient update more directly towards global optimum.
- Requires less tuning of learning rate hyperparameter.

Disadvantages
- Performs well only on simple quadratic problems but stops too early when training neural networks.

Fixes AdaGrad's problem of stopping too early by exponentially decaying gradient vector.
$s \larr \beta s + (1 - \beta)\nabla_{\theta}J(\theta) \otimes \nabla_{\theta}J(\theta)$
$\theta \larr \theta - \eta \nabla_{\theta}J(\theta) \oslash \sqrt{s + \varepsilon}$

Using RMSProp in Keras

optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

Combines the idea of momentum optimization and RMSProp.
- Like momentum optimization, keeps track of exponentially decaying average of past gradients.
- Like RMSProp, keeps track of exponentially decaying average of past squared gradients.

Using Adam in Keras

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Adam scales down paramter updates by the $\ell_2$ norm. AdaMax uses the $\ell_{\infin}$ norm. $s \larr \max (\beta_2 s, \nabla_{theta} J(\theta))$

AdaMax is more stable than Adam in practice, but it depeds on the dataset. In general, Adam performs better.

Optimizer

Vanilla gradient descent

Momentum optimization

Nesterov Accelerated Gradient

AdaGrad

RMSProp

Adam

Nadam

AdaMax

Convergence speed

⭐️

⭐️⭐️

⭐️⭐️⭐️

Convergence quality

⭐️⭐️⭐️

⭐️ (stops too early)

⭐️⭐️/ ⭐️⭐️⭐️