Day 19 - Nesterov Accelerated Gradient

A faster variant of momentum optimization.

Gradient of the cost function is not calculated at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta \vec{m}$ .

Nesterov Accelerated Gradient algorithm $\vec{m} \larr \beta \vec{m} - \eta \nabla_{\theta}J(\theta + \beta \vec{m})$ $\theta \larr \theta + \vec{m}$

This tweak works because the momentum vector will be pointing towards the optimum so it's slightly more accurate to measure the gradient a bit further in that direction.

When momentum pushes the weights across a valley, regular momentum optimization continues to push further across the valley while NAG pushes back toward the bottom of the valley.

Using Nesterov Accelerated Gradient in Keras

optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)