Notes on backpropagation
These are my notes on using backpropagation for training neural networks.
If you are using a machine learning library that implements automatic differentiation (“autograd”), you don’t necessarily need knowledge about backpropagation. However, if for some reason you cannot use a library with autograd, want to implement your own library, or just want to understand the fundamentals of artificial neural networks, you do need an understanding of backpropagation.
Context
Mathematically, an artificial neural network can be seen as a function that depends on a (usually large) set of parameters . During training the parameters get adjusted such that
for tuples in a training set . We can think of as the “right” output for input . The hope is that the trained neural network “generalizes”, that is, provides a good output for inputs that are not in the training set.
This is usually done by defining a loss function , which maps two vectors in the output space of the neural network to a non-negative number. The loss function can be thought of as a measure of how “distant” two output functions are. That is, the smaller is, the closer the output of the neural network is to desired output for input .
Then, the parameters are adjusted to minimize using some variant of gradient descent.
To use gradient descent, the gradient of the loss function with respect to the parameters has to be computed. The most common algorithm to do this is called “backpropagation”, and, depending on who you ask, it’s either the pinnacle of modern science and technology, or a straightforward application of the chain rule, known since 1676.
Structure of neural networks
Most artificial neural networks are structured in layers. We can think of each of these layers as function with its own parameters. The layers then have vectors as input and output.
The evaluation of is called forward propagation or inference, and it’s done by evaluating the layers from first to last. Every layer takes an input , the parameters , and returns an output .
During backpropagation, we go the other way: From the last layer (which has output the final output during inference), to the first layer.
To do this, the chain rule for multivariate functions can be used, so that we get and . However, in many situations it works better to look at a single element of , which is the partial derivative of with respect to a single parameter :
Then, the expression for can be worked out using the definition of the layer and substituted into this equation. The result can often be simplified to something more efficient to evaluate than the expression you get from explicitly evaluating the product .
Examples
In the following, and are row vectors. In mathematics, it is more common to use column vectors, but row vectors have couple of attractive properties. For one, they are more easily written because of their horizontal layout. Many programming languages implement row vectors as an array, and a column vector as an array of arrays. Further, it is convenient that the partial derivative and are both row vectors and have the same shape.
Linear layer
A linear (or “fully connected”) layer simply maps the input row vector to an output defined by
The parameters are , which is the matrix of weights, and which is the bias.
For a single component of the output we have
Gradient of the input
We have , so that and
Gradient of the weights
By the multivariate chain rule, we have
Taking the derivative of with respect to , we get
And substituting this in yields , so that
Gradient of the bias
We can do something similar for the bias. Again we use the multivariate chain rule to get
By taking the derivative of with respect to we see that
Substituting this in gives so that
Activation layer
An activation layer applies a function to every element of elementwise. So, and
Activation layers have no parameters.
Gradient of the input
We have
and
So that
Mean squared error loss function
For an input , the mean squared error loss function is defined as
The factor of 2 in the denominator makes the gradient slightly nicer. The loss function depends on the input as well as on a vector , but we consider a constant instead of a parameter, so we will not compute its gradient.
Considering the derivative with respect to a single element , we see , so that