Day 3 - Backpropagation

One of the algorithms used to train neural networks.

Handles one mini-batch at a time, and goes through the full training set multiple times (each pass through the entire set is called an epoch).

Uses an efficient way to calculate gradients - automatic differentiation.

In essence, it is the gradient descent algorithm:
- Each mini-batch is passed through the input of the network, and all intermediate layers until the output layer (aka the forward pass).
- All intermediate results are preserved to be used in the backward pass.
- The algorithm measures the network's error using a loss function (compares the network's output and actual output and returns a numerical value for the error).
- The error gradient is computed using the chain rule starting from the output layer and propagated all the way back to the input layer (aka the backward pass).
- A gradient descent step is performed to tweak all weights using the error gradients calculated.
- Above steps are repeated until network converges to a solution.

Automatic Differentiation (autodiff)

Automatic way of computing gradients is known as automatic differentiation or autodiff.

Two types of autodiff:
- Forward-mode autodiff
- Reverse-mode autodiff

Backpropagation uses reverse-mode autodiff which calculates gradients in just two passes through the network:
- Pass 1: forward pass to compute and store computed value at each node.
- Pass 2: reverse pass to compute all partial derivatives using the chain rule.

Reverse-mode autodiff requires only one pass per output of the network (i.e. loss function)

TensorFlow implements symbolic reverse-mode autodiff - a new computation graph of the derivative is produced instead of calculating values on the fly.

Advantages of symbolic autodiff:
- Graph of derivative needs to be computed only once and can be used repeatedly.
- A graph of a higher order derivative can be generated.

Initialization of weights

All hidden layer weights should be initialized randomly.

If all weights are equal (e.g. equal to zero) then backpropagation affects all of them in the exact same way.

Neurons will remain identical throughout training (model will act as if there is just one neuron per layer).

Random initialization of weights will break the symmetry and allow backpropagation to train a diverse set of neurons.

Change in activation function

Key change in architecture to make backpropagation work: replace step function with logistic (sigmoid) function: $\sigma(z) = \frac{1}{1 + e^{-z}}$ .

Step function contains only flat segments, so there is no gradient to work with.

Other popular choices for activation function: hyperbolic tangent and Rectified Linear Unit (ReLU)

Hyperbolic tangent: $\tanh(z) = 2σ(2z) – 1$
- S-shaped
- Continuous
- Differentiable
- Range: -1 to 1 (instead of 0 to 1 like sigmoid) ⇒ each layer's output is centered around 0 at the beginning of training which often helps speed up convergence.

ReLU: $\text{ReLU}(z) = max(0, z)$
- Continuous
- Not differentiable at $z=0$
- Derivative is 0 for $z < 0$
- Works well in practice
- Is fast to compute
- Helps reduce issues of unstable gradients in gradient descent because it doesn't have maximum output value