Day 18 - Momentum Optimization

Regular gradient descent takes small, regular steps down the slope. Takes much more time to reach the bottom.

Momentum optimization considers the value of the previous gradient.

At each iteration: local gradient is subtracted from the momentum vector $\vec{m}$ (multiplied by the learning rate).

New hyperparameter $\beta$ (called the momentum) simulates a "friction" mechanism to prevent momentum vector from growing too large. It is a value between 0 (high friction) and 1 (no friction).

A typical value for $\beta$ is 0.9.

The momentum algorithm
$\vec{m} \larr \beta \vec{m} - \eta \nabla_\theta J(\theta)$
$\theta \larr \theta + \vec{m}$

Momentum optimization in Keras

optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

With $\beta = 0.9$ , the maximum size of the weight update is 10 times the gradient times the learning rate - 10 times faster than regular gradient descent.

Momentum optimization escapes plateaus much faster than regular gradient descent.

Momentum optimization can also help roll past local optima.

Due to the momentum, the optimizer may overshoot a bit and keep oscillating before stabilizing at minimum. The friction in the system helps reduce these oscillations and speed up convergence.

Drawback: one more hyperparameter to tune. However, a value of 0.9 works well in practice.