Day 2 - Gradient descent
Gradient Descent
- Gradient descent is an optimization algorithm that iteratively tries to find the best parameters for a model.
- We know how good (or bad) the parameters are by looking at the value of the cost function.
The Algorithm
- Start with random values for the parameters.
- Measure the gradient of the cost function with respect to the parameters.
- Update the parameters in the direction of descending gradient to decrease the cost function.
Steps 2 & 3 are repeated until the best parameters are found (the algorithm converges/the gloabl minimum of the cost function is reached).
Learning Rate
- How large a change you make to the parameters in step 3 depends on the learning rate.
- If the learning rate is too small, the algorithm will take too long to converge.
- If the learning rate is too big, the algorithm will overshoot the minima and might even end up diverging.
Types of Gradient Descent
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
Comparison of gradient descent types
Property | Batch | Stochastic | Mini-batch |
---|---|---|---|
Amount of data used in each gradient descent step | Entire dataset | One sample | Fixed random number of samples |
Speed | Slow | Fast | Fast on GPU |
Memory requirement | High | Low | Depends on batch size |
Shuffling of data required | No | Yes, so that data is Independent and Identically Distributed (IID) and the algorithm goes towards the global minimum on average instead of trying to optimize for one label and then the next in the case of data being sorted by label | Yes |
Disadvantages | Since entire dataset is not used, gradient updates are not completely accurate (in direction of steepest descent of cost function) Shuffling data and going through each instance slows down convergence | More prone to getting stuck in local minima or plateau compared to stochastic gradient descent | |
Advantages | Scales well with number of features | Works well for really large datasets (since only one instance has to be kept in memory at a time) Since gradient descent updates are not in direction of steepest descent, it has the ability to jump out of a local minima | Less erratic than stochastic gradient descent |
Learning Schedule | Start with a large learning rate to enable algorithm to escape local minima, gradually reduce it to allow algorithm to settle at global minimum |
- Gradient descent converges faster if all features have a similar scale.
- A common practice while using gradient descent is early stopping - start training for a large number of iterations and stop when the gradient vector becomes very small. This happens when gradient descent has almost reached the minimum.