Day 20 - Adaptive Learning Rates
- Adaptive Learning Rate: the effective learning rate keeps being adjusted as training progresses.
AdaGrad
- The learning rate is decayed faster for steeper dimensions than for dimensions with gentler slopes.
- Advantages
- Points resulting gradient update more directly towards global optimum.
- Requires less tuning of learning rate hyperparameter.
- Disadvantages
- Performs well only on simple quadratic problems but stops too early when training neural networks.
RMSProp
- Fixes AdaGrad's problem of stopping too early by exponentially decaying gradient vector.
- is the decay rate - a hyperparameter that is typically set to 0.9.
- Using RMSProp in Keras
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)
Adam
- Adam = Adaptive Moment Estimation.
- Combines the idea of momentum optimization and RMSProp.
- Like momentum optimization, keeps track of exponentially decaying average of past gradients.
- Like RMSProp, keeps track of exponentially decaying average of past squared gradients.
- The momentum decay parameter is typically initialized to 0.9.
- The scaling decay parameter is often initialized to 0.999.
- Using Adam in Keras
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
AdaMax
- Adam scales down paramter updates by the norm. AdaMax uses the norm.
- AdaMax is more stable than Adam in practice, but it depeds on the dataset. In general, Adam performs better.
- Try it out if experiencing problems with Adam on some task.
Nadam
- Adam optimization + Nesterov momentum.
- Converges slightly faster than Adam.
- Nadam generally outperforms Adam but is sometimes outperformed by RMSProp.
Comparison of optimizers
Optimizer
Vanilla gradient descent
Momentum optimization
Nesterov Accelerated Gradient
AdaGrad
RMSProp
Adam
Nadam
AdaMax
Convergence speed
⭐️
⭐️⭐️
⭐️⭐️
⭐️⭐️⭐️
⭐️⭐️⭐️
⭐️⭐️⭐️
⭐️⭐️⭐️
⭐️⭐️⭐️
Convergence quality
⭐️⭐️⭐️
⭐️⭐️⭐️
⭐️⭐️⭐️
⭐️ (stops too early)
⭐️⭐️/ ⭐️⭐️⭐️
⭐️⭐️/ ⭐️⭐️⭐️
⭐️⭐️/ ⭐️⭐️⭐️
⭐️⭐️/ ⭐️⭐️⭐️