In this post, I am going to describe what I have just learned from Andrew Ng at Stanford about “learning curves”. To computer scientist, a learning curve is what you might expect but describes how well data has been modeled.
I write this as a classically trained psychologist and it is clear that if we are to understand machine learning, we have to watch out for where the thinking of computer scientists differs radically from our own. This is my commonsensical comparison of the two approaches. I am writing it down to make sure I have followed what I heard. It is rough and ready but may help you understand the differences between the two disciplines.
A learning curve in CS
Simply, the CStists take random samples of data where the first sample is very small, let’s say 1 because that is helpful to understanding the logic, and the last sample will be large, let’s say a few thousand. This is random samples from the same large data set.
Generally, with a sample of 1 up to 3, we can model perfectly. However, when we try the same model with another sample of the same size, the model will not predict well at all. The amounts of error for the experimental sample and the comparison sample will be hugely different. So far so good. That’s what we all learned at uni. Modelling on a small sample is the equivalent of an ‘anecodote’. Whatever we observed may or may not transfer to other situations.
As we increase our sample size, paradoxically the amount of error in our model increases but the amount of error in our comparison situation decreases. And ultimately, the error we are making in the two situations converges. We also know this from uni.
Much of our training goes into getting us to do this and to increasing the sample size so that the error in the hypothetical model goes up, and the error in the comparison model goes down. Plot this on a piece of paper with error on the y axis and sample size on the x axis.
When the two error rates converge, that is we can explain the future as well as we can explain the present, then we stop and say, “Hey, I have found a scientific law!”
I would say that our willingness to tolerate a more general description of a particular situation so that we can generalize at the same level of accuracy (and inaccuracy) to another situation is one of the hallmarks of uni training. This is so counter-intuitive that many people resist so it takes uni training to get us to do it.
What the computer scientists implicitly point out is that the converse is also true. We are now able to explain the future as badly as we explain the present! They call this underfitting and suggest that we try another model to see if we can do a better job of explaining the present. So we will stop increasing the sample size and start playing with the model. We can vary the form of the model, typically moving from a linear to a non-linear model (that is adding more features) and increasing the weights of the parameters (go from a loose floppy kind of model to a stiffer model, if you like).
They do this until the model overfits. That is, until our explanation of the present is very good but the same explanation produces errors in comparison situations. When they reach this point, they backtrack to a less complicated model (fewer non-linear terms) and decrease the weights of the parameters (take note of a feature but not put too much emphasis on it.)
Once they have found this happy middle ground with a more complicated model, but without the expense of collecting more data, they will try it out on a completely new set of data.
Break with common practice in psychology
For any psychologists reading this
- This kind of thinking provides us with a possibility of getting away from models that have been stagnant for decades. Many of these models predict the present so-so and the future so-so. Here is the opportunity to break away.
- Note that machine learning specialists use procedures that look like statistics but abandon the central idea of statistics. They aren’t promising that their original sample was randomly chosen and they aren’t directly interested in the assertion that “if and only if our original sample was random, then what we found in the sample generalizes to other samples that have also been chosen randomly”. Though they do something similar (taking lots of randomly chosen slices of data from the data they have), they aren’t in the business of asserting the world will never change again. They have high speed computers to crunch more data when it becomes clear that the world has changed (or that our model of the world is slightly off).
- Many of the rules-of-thumb that we were once taught fall away. Specifically, get a large sample, keep the number of features below the size of the sample, keep the model simple – these prescriptions are not relevant once we change our starting point. All we want to find is the model that can generalize from one situation to another with the least error and high speed computers allow us both to use more complicated models and recomputed them when the world they described changes.
I am still to see good working examples outside marketing on the one hand and robotics on the other, but it seemed worth while trying to describe the mental shift that a classically trained psychologist will go through. Hope this helps
Comments