Day 2 - Gradient descent

Gradient Descent

Gradient descent is an optimization algorithm that iteratively tries to find the best parameters for a model.

We know how good (or bad) the parameters are by looking at the value of the cost function.

The Algorithm

Start with random values for the parameters.

Measure the gradient of the cost function with respect to the parameters.

Update the parameters in the direction of descending gradient to decrease the cost function.

Steps 2 & 3 are repeated until the best parameters are found (the algorithm converges/the gloabl minimum of the cost function is reached).

Learning Rate

How large a change you make to the parameters in step 3 depends on the learning rate.

If the learning rate is too small, the algorithm will take too long to converge.

If the learning rate is too big, the algorithm will overshoot the minima and might even end up diverging.

Types of Gradient Descent

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Comparison of gradient descent types

Property	Batch	Stochastic	Mini-batch
Amount of data used in each gradient descent step	Entire dataset	One sample	Fixed random number of samples
Speed	Slow	Fast	Fast on GPU
Memory requirement	High	Low	Depends on batch size
Shuffling of data required	No	Yes, so that data is Independent and Identically Distributed (IID) and the algorithm goes towards the global minimum on average instead of trying to optimize for one label and then the next in the case of data being sorted by label	Yes
Disadvantages		Since entire dataset is not used, gradient updates are not completely accurate (in direction of steepest descent of cost function) Shuffling data and going through each instance slows down convergence	More prone to getting stuck in local minima or plateau compared to stochastic gradient descent
Advantages	Scales well with number of features	Works well for really large datasets (since only one instance has to be kept in memory at a time) Since gradient descent updates are not in direction of steepest descent, it has the ability to jump out of a local minima	Less erratic than stochastic gradient descent
Learning Schedule			Start with a large learning rate to enable algorithm to escape local minima, gradually reduce it to allow algorithm to settle at global minimum

Gradient descent converges faster if all features have a similar scale.

A common practice while using gradient descent is early stopping - start training for a large number of iterations and stop when the gradient vector becomes very small. This happens when gradient descent has almost reached the minimum.