Day 18 - Momentum Optimization
- Regular gradient descent takes small, regular steps down the slope. Takes much more time to reach the bottom.
- Momentum optimization considers the value of the previous gradient.
- At each iteration: local gradient is subtracted from the momentum vector (multiplied by the learning rate).
- New hyperparameter (called the momentum) simulates a "friction" mechanism to prevent momentum vector from growing too large. It is a value between 0 (high friction) and 1 (no friction).
- A typical value for is 0.9.
- The momentum algorithm
- Momentum optimization in Keras
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)
- With , the maximum size of the weight update is 10 times the gradient times the learning rate - 10 times faster than regular gradient descent.
- Momentum optimization escapes plateaus much faster than regular gradient descent.
- Momentum optimization can also help roll past local optima.
- Due to the momentum, the optimizer may overshoot a bit and keep oscillating before stabilizing at minimum. The friction in the system helps reduce these oscillations and speed up convergence.
- Drawback: one more hyperparameter to tune. However, a value of 0.9 works well in practice.