I have just finished the Stanford back propagation exercise, and to put it mildly, it was a ****.
So is back propagation complicated? And indeed what is it?
These are my notes so that I don’t have to go through all the pain when I do this again. I am not an expert and the agreement with Stanford is that we don’t give away the answer particularly at the level of code. So use with care and understand that this can’t tell you everything. You need to follow some lecture notes too.
Starting from the top: What is back propagation?
Back propagation is a numerical algorithm that allows us to calculate an economical formula for predicting something.
I am going to stick to the example that Stanford uses because the world of robotics seems infinitely more useful than my customary field of psychology. Professor Ng uses an example of handwriting recognition much as the Royal Mail must use for reading postal codes.
We scan a whole lot of digits and save each digit as a row of 1’s and 0’s representing ink being present on any one of 400 (20×20) pixels. Can you imagine it?
Other problems will always start the same way – with many cases or training examples, one to each row; and each example described by an extraordinary large number of features. Here we have 400 features or columns of X.
The second set of necessary input data is one last column labeling the row. If we are reading digits, this column will be made up of digits 0-9 (though 0 is written down as 10 for computing reasons). The digit is still 0 in reality and if we reconstructed the digit by arranging the 400 pixels, it will still be seen to the human eye as 0.
The task is to learn a shorthand way for a computer to see a similar scan of 400 pixels and say, aha, that’s a 1, or that’s a 2 and so on.
Of course the computer will not be 100% accurate but it will get well over 95% correct as we will see.
So that is the input data: a big matrix with examples along the rows of features and with the last column being the correct value – the digit from (10, 1-9) in this case.
How does back propagation work?
Back propagation programs work iteratively without any assumptions about statistics that we are used to in psych.
The computer boffins start by taking a wild guess of the importance of each pixel for a digit, and see what the computer would predict with those weights. That is called the forward pass.
Then based on what the computer got right or wrong, they work backwards to adjust the weights or importance of each pixel for each digit.
And remembering that computers are pretty fast, the computer can buzz back and forth asking “how’s this?”.
After a set number of trials, it stops improving itself and tells us how well it can read the digits, i.e., compares its answers to the right answers in the last column of our input data.
What is a hidden layer?
Back proagation also has another neat trick. Instead of using pixels to predict digits, it works with an intermediate or hidden layer. So the pixels predict some units in the hidden layer and the hidden layer predicts the digits. Choosing the number of units in the hidden layer is done by trying lots of versions (10 hidden units, 50 hidden units, etc) but I guess computer scientists can pick the range of the right answer as they get experienced with real world problems.
In this example, the solution worked with 25 hidden layers. That is, 400 pixels were used to make predictions about 25 units which predict which of 10 digits made the data.
The task of the computing scientist is to calculate the weights from the pixels to the hidden layers and from the hidden layers to the digits and then report the answer with a % of “training accuracy” – over 95%, for example.
Steps in back propagation
We have already covered the first four steps
Step 1: Training data
Get lots of training data with one example on each row and lots of features for each example in the columns.
Make sure the row is labeled correctly in the last column.
Step 2: Decide on the number of units in the hidden layer
Find out what other people have tried for similar problems and start there (that’s the limit of my knowledge so far).
Step 3: Initialize some weights
I said before, we start with wild guess. Actually we start with some tiny numbers but the numbers are random.
We need one set of weights linking each pixel to each hidden layer (25 x 400)* and another set linking each hidden layer to each digit (10 x 25)*.
The asterisk means that a bias factor might be added in raising one or the other number by 1. To keep things simple, I am not going to discuss the bias factor. I’ll just flag where it comes up. Be careful with them though because I am tired and they might be wrong.
Step 4: Calculate the first wildly inaccurate prediction of the digits
Use the input data and the weights to calculate initial values for the hidden layer.
Our input data of training examples and features (5000 examples by 400 pixels) is crossed with the appropriate initial random weights (25 x 400) to get a new matrix of hidden layer values. Each training example will have 25 new values (5000 x 25)*.
Then repeat again from the hidden layer to the layer of digits or output layer making another matrix of 5000 x 10.
In the very last step, the calculated value is converted into a probability with the well know sigmoid function. It would be familiar if you saw it. I’ll try to patch it in.
The values calculated at the hidden layer are converted into these probability-type values and they are used for the next step and the final answer is converted in the same way.
Now we have a probability type figure for each of 10 digits for each training example (5000 x 10)*.
Step 5: Find out how well we are doing
In this step, we first convert the correct answer (which was a 1, or 5, or 7 or whatever the digit was) into 1’s and 0’s – so we have another matrix (5000 x10).
We compare this with the one we calculated in Step 4 using simple subtraction and make yet another matrix (5000 x 10).
Step 6: The backward pass begins
So far so good. All pretty commonsensical. The fun starts when we have to find a way to adjust those guessed weights that we used at the start.
Staying at a commonsensical level, we will take error that we have in that big 5000 x 10 matrix calculated in Step 5 and partition it up so we can ‘track’ the error back to training examples and hidden layers and then from hidden layers to pixels. And this is what the computing scientists do. T
hey take one training example at a time (one of the 5000 rows), pick out the error for digit 1, and break it up. And do it again for digit 2 up to digit 0 (which we input as 10).
Step 7: Working with one training example at a time
It might seem odd to work with one training example at a time, and I suspect that is just a convenience for noobes, but stick with the program. If you don’t, life gets so complicated, you will feel like giving up.
So take example one, which is row 1; and do the stuff. And repeat for row 1, and so on until you are done.
In computing this is done with a loop: for 1: m where m is the number of training examples or rows (5000 in our case). The machine is happy doing the same thing 5000 times.
So we do everything we did before this step but we start by extracting our row of features: our X or training data how has 1 row and 400 features (1 x 400)*.
And we still have one label, or correct answer but remember we will turn that into a row of 1’s and 0’s. So if the right answer is 5, the row will be 0000100000 (1 x10).
And we can recalculate our error, or uplift the right row from matrix of observed values that we calculated in Step 6. The errors at the ‘output_layer’ will be a row of ten numbers (1 x 10). They can be positive or negative and the number bit will be less than 1.
Step 8: Now we have to figure out the error in the hidden layer
So we know our starting point of pixels (those never get changed), the correct label (never gets changed) and the error that we calculated for this particular forward pass or iteration. After we adjust the weights and make another forward pass, our errors change of course and hopefully get smaller.
We now want to work on the hidden layer, which of course is hidden. Actually it doesn’t exist. It is a mathematical convenience to set up this temporary “tab”. Nonetheless, we want to partition the errors we saw at the output layer back to the units in the hidden layer (25 in our case)*.
Just like we had at the output layer, where we had one row of errors (1 x 10), we now want a row or column of errors for the hidden layer (1 x25 or 25 x 1)*.
We work out this error by taking the weights we used in the forward pass and multiplying by the observed error and weighting again by another probabilistic value. This wasn’t explained all that well. I’ve seen other explanations and it makes intuitive sense. I suspect our version is something to do with computing.
So here goes. To take the error for hidden layer unit 1, we take the ten weights that we had linking that hidden unit to each digit. Or we can take the matrix of weights (10 x 25)* and match them against the row of observed errors (1 x 10). To do this with matrix algebra, then we turn the first matrix on its side (25 x 10) and the second on its side (10 x 1) and we the computer will not only multiply, it will add up as well giving us one column of errors (1 x25).* Actually we must weight each of these by the probabilistic type function that we called sigmoidGradient.
We put into sigmoidGradient a row for the training example that was calculated earlier on as the original data times the weights between the pixels and the hidden layer ((5000 x 400*) times (25 x 400*))– the latter is tipped on its side to perform matrix algebra and produce a matrix of 25* values for each training example (5000 x 25*).
Picking up the column of data that we calculated one paragraph up, we now have two columns (25* x1) which we multiple (in matrix algebra .* so we can do multiplication of columns like we do in Excel).
Now we have a column of errors for the hidden layer for this one particular training example (25* x1). (Our errors at the output layer for this person was in a row (1 x 10).
Step 9: Figure out how much to adjust the weights
Now we know how much error is in the output layer and the hidden layer, we can work on adjusting the weights.
Remember we have two sets of weights. Between the output and hidden layer we had (10 x 25*) and between the input layer and the hidden layer, we had (25 x 400*). We deal with each set of weights separately.
Taking the smaller one first (for no particular reason but that we start somewhere), we weight the values of the hidden layer with the amount of error in the output layer. Disoriented? I was. Let’s look again what we did before. Before we used the errors in the output layer to weight the weights between output and hidden layer and we weighted that with a probabilistic version of input data times the weights coming between input and hidden layers. That seemingly complicated calculation produced a set of errors – one for each hidden layer – just for this training example because we still working with just one row of data (see Step 8).
Now we are doing something similar but not the same at all. We take the same differences from the output layer (1 x10) and use them to weight the values of the hidden layer that we calculated on the forward pass (1×25*). This produces (and this is important) a matrix that will have the same proportions as the weights between the hidden and output layer. So if we have 10 output possibilities (as we do) and 25* units in the hidden layer, then at this stage we are calculating a 10 x 25* matrix.
So for each training example (original row), we have 250 little error scores, one for each combination of output and hidden units (in this case 10×25*).
Eventually we want to find the average of these little errors over all our training examples (all 5000), so we whisk this data out of the for loop into another matrix. As good programmers, we set this up before and filled it up with zeros (before the for loop started). As we loop over training examples, we just add in the numbers and we get a total of errors over all training examples (5000) for each of the combos of hidden unit and output unit (10 x25*).
And doing it again
We have a set of errors now for the connections between hidden and output layers. We need to do this again for the connections between the input layer and the hidden layer.
We already have the errors for the hidden layer (25* x1) (see Step 8). We use these to weight the input values (or maybe we should think of that the other way round – we use the input values to weight the differences).
We take the errors for the hidden layer (25 x1) and multiple by the row of original data ( 1 x 400*) and we will get a matrix of (25 x 400*) – just like our table of weights! You might notice I did not put an asterisk on the 25 x1 matrix. This is deliberate. At this point, we take out the bias factor that we put in before.
We do the same trick of storing the matrix of error codes (25 x 400*) in a blank matrix that we set up earlier and then adding the scores for the next training example, and then the next as we loop through all 5000.
Step 10: Moving on
Now we have got what we want: two matrices, exactly the same size as the matrices for the weights ( 25 x 400* and 10 x 25*). Inside these matrices are the errors added up over all training examples (5000).
To get the average, we just have to divide by the number of training examples (5000 in this case). In matrix algebra we just say – see that matrix? Divide every cell by m (the number of training examples). Done.
These matrices – one 25 x 400* and the other 10 x 25* are then used to calculate new tables of weights. And we rinse and repeat.
- Forward pass : make a new set of predictions
- Back propagation as I described above.
- Get two matrices of errors: yay!
- Recalculate weights.
- Stop when we have done enough.
The next questions are how are the weights recalculated and how do we know if we have done enough?
Recalculating weights
The code for the back propagation algorithm is contained within a function that has two purposes:
- To calculate the cost of a set of weights (average error in predictions if you like)
- And the matrices that we calculated to change the weights (also called gradients).
The program works in this order
- Some random weights
- Set up the step-size for learning (little or big guesses up or down) and number of iterations (forward/back ward passes)
- Call a specialized function for ‘advanced optimization’ – we could write a kluxy one but this is the one we are using
- The advanced optimizer calls our function.
- And then performs its own magic to update the weights.
- We get called again, do our thing, rinse and repeat.
How do we know we have done enough?
Mainly the program will stop at the number of iterations we have set. Then it works out the error rate at that point – how many digits are we getting right and how many not.
Oddly, we don’t want 100% because that would probably just mean we are picking up something quirky about our data. Mine eventually ran at around 98% meaning there is still human work and management of error to do if we are machine reading postal codes. At least that is what I am assuming.
There you have it. The outline of the back propagation. I haven’t taken into account the bias factor but I have stressed the size of the matrices all the way through, because if there is one thing I have learned, that’s how the computing guys make sure they aren’t getting muddled up. So we should too.
So now I will go through and add an * where the bias factor would come into play.
Hope this helps. I hope it helps me when I try to do this again. Good luck!
The regularization parameter
Ah, nearly forgot – the regularization parameter. Those values – those little bits of error in the two matrices that are the same size as the weights – (25×400*) and (10×25*)?
Each cell in the matrix except for the first column in each which represents the bias factor, must be adjusted slightly by a regularization parameter before we are done and hand the matrices over to the bigger program
The formula is pretty simple. It is just the theta value for that cell times by the learning rate (set in the main program) and divided by the number of training cases. Each of the two matrices is adjusted separately. A relatively trivial bit of arithmetic.