Day 2 - Gradient descent

Gradient Descent

The Algorithm

  1. Start with random values for the parameters.
  1. Measure the gradient of the cost function with respect to the parameters.
  1. Update the parameters in the direction of descending gradient to decrease the cost function.

Steps 2 & 3 are repeated until the best parameters are found (the algorithm converges/the gloabl minimum of the cost function is reached).

Learning Rate

Types of Gradient Descent

  1. Batch gradient descent
  1. Stochastic gradient descent
  1. Mini-batch gradient descent

Comparison of gradient descent types

Property Batch Stochastic Mini-batch
Amount of data used in each gradient descent step Entire dataset One sample Fixed random number of samples
Speed Slow Fast Fast on GPU
Memory requirement High Low Depends on batch size
Shuffling of data required No Yes, so that data is Independent and Identically Distributed (IID) and the algorithm goes towards the global minimum on average instead of trying to optimize for one label and then the next in the case of data being sorted by label Yes
Disadvantages Since entire dataset is not used, gradient updates are not completely accurate (in direction of steepest descent of cost function) Shuffling data and going through each instance slows down convergence More prone to getting stuck in local minima or plateau compared to stochastic gradient descent
Advantages Scales well with number of features Works well for really large datasets (since only one instance has to be kept in memory at a time) Since gradient descent updates are not in direction of steepest descent, it has the ability to jump out of a local minima Less erratic than stochastic gradient descent
Learning Schedule Start with a large learning rate to enable algorithm to escape local minima, gradually reduce it to allow algorithm to settle at global minimum