For me, backpropagation has always been the holy grail of Artificial Neural Network (ANN) training, but until recently my only way of training an ANN was with genetic algorithms, which i something I will hopefully cover in another post.
For some people, it might be useful to jump to “Conclusion & Recap” at the bottom first, where you will find a quick walkthrough of all the different elements I’ll be describing.
If we go back a few years, to when I first read about backpropagation, and wanted to understand this unique way of training an ANN, it seemed really complicated, and I was quickly scared away by the many big, scary mathematical definitions and calculations, which I had never studied before (Ex. Linear algebra).
Despite this, I recently took up the challenge again, and made another attempt to understand backpropagation and I finally understood the concept! It quickly struck me how SIMPLE standard gradient descent really was! And how easily gradient descent could have been explained to me… Because gradient descent is, once you understand it, nothing more than multiplying a few factors together, but we’ll get to that soon.
Previous mathematical knowledge expected from the reader will be minimal, and mostly limited to knowing what a gradient is.
(In the bottom of this post I’ll link some exceptional material which helped me understand this topic, so you’ll have extra material if needed.)
The Goal Of Backpropagation / Gradient Descent?
When we want to train an ANN to perform better, we “search” for the most optimal weights, so our networks prediction will be closer to the expected output.
ANN’s often have a large quantity of weights, meaning a brute force tactic is already out of the window. The curse of dimensionality quickly forces us to use multiple years, if not lifetimes, to get the best set of parameters. To combat this curse, we introduce gradient descent where the principle is simple: We calculate the derivative / slope of the prediction error, with respect to a selected weight, and then take a small step downhill to increase our performance.
Ultimately, this will move us a small step closer to a better solution, so by performing this process multiple times, we will continually move our network closer to a better prediction until a minimum is reached.
The Chain Rule.
Okay, so we now know we need to calculate the derivative of the prediction error with respect to the weights, to improve our network. But how do we calculate these derivatives?
But what is the chain rule? The chain rule simply states that the derivative of f(g(x)) is f'(x) * g'(x).
It can be illustrated like this:
How Does This Apply To Artificial Neural Networks?
Let’s start with a simple case and move our way up. Imagine a single layer perception as shown below:
We want to calculate the slope / derivative of the Error with respect to W, so we know if we have to increase or decrease the weight. For now, just forget about how we calculate the error, and which activation function our perceptron uses, so that we can focus on the chain rule.
Our goal is clear: We want to calculate the derivative of the Error with respect to W. The way to get there is of course the chain rule, and by using it we get the following formula:
But why is this true? Let’s look at how we calculate Net, Out, and Error.
Seeing the definitions above should make it clear how each part is dependent on the next. Now let’s look at how each individual part is calculated.
The Cost Function / Prediction Error.
The Error is defined as:
But how is Cost() defined? One of the most popular cost functions is the modified MSE cost function, which is defined as:
But we are more interested in Cost-Prime, or more specifically the derivative of the cost with respect to Out. This defines how the output of the node affects the overall error of the network, Which for the modified MSE cost is defined as:
You can confirm this by calculating the derivative yourself.
Out is defined as
But how is this Activation() defined? The most popular, as you might know, is the sigmoid activation function, which is defined as:
And again, we are more interested in the Sigmoid-Prime function with respect to Net, which is defined as:
If you’re interested in why, then look at this.
Net is defined as
Where the derivative with respect to W is equal to x, or more precisely the output of the node at the start of the weight, which in this case is the input node.
The Derivative With Respect To w:
We now know how to calculate each part of the formula, which gives us the following:
Once the derivative has been calculated, we can try to update our weight accordingly.
When the derivative is positive, we must lower our weight and reverse if it’s negative to walk downhill.
The reason behind this is that a positive derivative tells us that the function is currently increasing along with the weight. Since we want it to decrease, we have to do the opposite of the derivative.
The equation for updating the weight is defined as:
Where w’ is the updated weight, w is the original weight and r being the learning rate.
The learning rate is a term we haven’t yet covered, it defines how big the steps downhill should be taken. Having the learning rate too small could increase the overall training time, but having the learning rate too large could make the network jump out of a minima, therefore increasing the overall error instead of decreasing it.
Let’s try to train a network to reverse it’s input. So 1 becomes 0 and vice versa.
(This wont actually be possible with a network of this size)
We’ll use the same perceptron as above, and try to calculate the derivative with respect to w.
Let the input be 1 and target output 0, with an initial weight of 0.5.
Our initial hypothesis is that the derivative will be positive. Increasing the weight should increase the output too, since the sigmoid activation function increases it’s output when the input is increased.
Therefore, we will have to lower our weight accordingly.
Passing the input forward into our network gives the output: 0.62.
Which means the derivative of the error with respect to the weight is calculated as:
The derivative with respect to the weight w was positive, therefore our initial hypothesis still stands. I said that we would have to lower our weight w, if we wanted our prediction to be closer to 0, this can quickly be confirmed, by calculating the output of the network with a weight w = 0.35. Which will give the output: 0.59 instead of the initial 0.62.
How Is This Applied To A Larger Neural Network?
Once you understand the simple case above, expanding it to a larger network is not hard. Let’s look at the larger network below.
The Keyword: Node Effect.
Now I don’t know if this term is generally accepted or anything, but when I had to understand how to apply this to a larger network, I came up with this term and it really helped me!
So, you might be wondering, what is a nodes effect?
A nodes effect is defined by how it’s output affects the overall network error / cost, which is defined as:
The quick student will have noticed that this equation was shown above in the chain rule example. Which changes the equation to:
Which can be rewritten to
So, whenever you want to calculate the derivative of ANY weight in the network, you just multiply the effect of the node at the right side of the weight with the sigmoidPrime function of it’s output with the output of the node on the left side of the weight! That is how simple it actually is, it’s no harder than multiplying a few numbers together.
Calculating The Node Effect.
Now, how to calculate the nodes effect depends on which type of layer you are looking at. Since the input layer doesn’t have any weights going into it, we don’t have to calculate it’s effect since it’ll never be used.
The Input Layer:
The total error of the network is calculated as the sum of errors over each output neuron, and since we have two, the equation becomes:
Since we only focus on the effect of one node, and the other node will be constant, the effect of any input layer node will be equal to the cost-prime function, which to recall is:
Any Hidden Layer:
The effect of any node in any hidden layer, is calculated as the sum of effects in the next layer multiplied by their respective weight:
So, for our network above, the effect of the first hidden node in the first hidden layer would be equal to:
This applies to any node in any hidden layer, for example the very last hidden layer right before the output layer is calculated in a similar manner:
Conclusion & Recap:
There you have it, how to perform backpropagation 😉 Feel free to comment below if anything is misleading or something is hard to understand, and I’ll happily reply and even update the post accordingly.
Just for a quick recap:
The chain rule:
The derivative of a function inside a function is calculated as the product of their individual derivatives, by applying the chain rule.
The effect of a node:
The effect of a node is my own personal term, and describes the effect the output of a node has on the overall error. Mathematically it’s defined as:
For a node in the input layer it’s calculated as:
For a node in any hidden layer it’s calculated as:
The derivative of any weight in the network:
To calculate the derivative of any weight in the network we use the following equation:
Where NodeEffect and SigmoidPrime is applied to the node on the right side of the weight, and out_prev is the output of the node on the left side of the weight.
Updating the weights:
Once the derivative for a weight has been calculated, we need to update it accordingly. We use the following equation:
r is the learning rate, which describes how fast we should adapt to changes. A small rate will increase training time, but too large of a rate could decrease learning overall. So, a middle ground is preferred.
As mentioned once before, you have to calculate all the new weights and then update all the weights at once, because the actual effect of a node has to be calculated using the original weights, not the updated ones.
Feel free to comment below.