- One of the algorithms used to train neural networks.
-
Handles one mini-batch at a time, and goes through the full training
set multiple times (each pass through the entire set is called an
epoch).
-
Uses an efficient way to calculate gradients - automatic
differentiation.
-
In essence, it is the gradient descent algorithm:
-
Each mini-batch is passed through the input of the network, and
all intermediate layers until the output layer (aka the forward
pass).
-
All intermediate results are preserved to be used in the
backward pass.
-
The algorithm measures the network's error using a loss
function (compares the network's output and actual output
and returns a numerical value for the error).
-
The error gradient is computed using the chain rule starting
from the output layer and propagated all the way back to the
input layer (aka the backward pass).
-
A gradient descent step is performed to tweak all weights using
the error gradients calculated.
-
Above steps are repeated until network converges to a solution.
Automatic Differentiation (autodiff)
-
Automatic way of computing gradients is known as
automatic differentiation or
autodiff.
-
Backpropagation uses reverse-mode autodiff which
calculates gradients in just two passes through the network:
-
Pass 1: forward pass to compute and store computed value at each
node.
-
Pass 2: reverse pass to compute all partial derivatives using
the chain rule.
-
Reverse-mode autodiff requires only one pass per output of the
network (i.e. loss function)
-
TensorFlow implements symbolic reverse-mode
autodiff - a new computation graph of the derivative is produced
instead of calculating values on the fly.
-
Advantages of symbolic autodiff:
-
Graph of derivative needs to be computed only once and can be
used repeatedly.
- A graph of a higher order derivative can be generated.
Initialization of weights
- All hidden layer weights should be initialized randomly.
-
If all weights are equal (e.g. equal to zero) then backpropagation
affects all of them in the exact same way.
-
Neurons will remain identical throughout training (model will act as
if there is just one neuron per layer).
-
Random initialization of weights will break the symmetry and allow
backpropagation to train a diverse set of neurons.
Change in activation function
-
Key change in architecture to make backpropagation work: replace
step function with logistic (sigmoid) function:
σ(z)=1+e−z1.
-
Step function contains only flat segments, so there is no gradient
to work with.
-
Other popular choices for activation function: hyperbolic tangent
and Rectified Linear Unit (ReLU)
-
Hyperbolic tangent:
tanh(z)=2σ(2z)–1
-
Range: -1 to 1 (instead of 0 to 1 like sigmoid) ⇒ each
layer's output is centered around 0 at the beginning of
training which often helps speed up convergence.
-
ReLU:
ReLU(z)=max(0,z)
-
Not differentiable at
z=0
-
Derivative is 0 for
z<0
-
Helps reduce issues of unstable gradients in gradient descent
because it doesn't have maximum output value