Learning curves and modelling in machine learning

In this post, I am going to describe what I have just learned from Andrew Ng at Stanford about “learning curves”.  To computer scientist, a learning curve is what you might expect but describes how well data has been modeled.

I write this as a classically trained psychologist and it is clear that if we are to understand machine learning, we have to watch out for where the thinking of computer scientists differs radically from our own.  This is my commonsensical comparison of the two approaches.  I am writing it down to make sure I have followed what I heard.  It is rough and ready but may help you understand the differences between the two disciplines.

A learning curve in CS

Simply, the CStists take random samples of data where the first sample is very small, let’s say 1 because that is helpful to understanding the logic, and the last sample will be large, let’s say a few thousand.  This is random samples from the same large data set.

Generally, with a sample of 1 up to 3, we can model perfectly.  However, when we try the same model with another sample of the same size, the model will not predict well at all. The amounts of error for the experimental sample and the comparison sample will be hugely different.  So far so good. That’s what we all learned at uni.  Modelling on a small sample is the equivalent of an ‘anecodote’.  Whatever we observed may or may not transfer to other situations.

As we increase our sample size, paradoxically the amount of error in our model increases but the amount of error in our comparison situation decreases.  And ultimately, the error we are making in the two situations converges.  We also know this from uni.

Much of our training goes into getting us to do this and to increasing the sample size so that the error in the hypothetical model goes up, and the error in the comparison model goes down.  Plot this on a piece of paper with error on the y axis and sample size on the x axis.

When the two error rates converge, that is we can explain the future as well as we can explain the present, then we stop and say, “Hey, I have found a scientific law!”

I would say that our willingness to tolerate a more general description of a particular situation so that we can generalize at the same level of accuracy (and inaccuracy) to another situation is one of the hallmarks of uni training. This is so counter-intuitive that many people resist so it takes uni training to get us to do it.

What the computer scientists implicitly point out is that the converse is also true. We are now able to explain the future as badly as we explain the present!  They call this underfitting and suggest that we try another model to see if we can do a better job of explaining the present.  So we will stop increasing the sample size and start playing with the model. We can vary the form of the model, typically moving from a linear to a non-linear model (that is adding more features) and increasing the weights of the parameters (go from a loose floppy kind of model to a stiffer model, if you like).

They do this until the model overfits. That is, until our explanation of the present is very good but the same explanation produces errors in comparison situations.  When they reach this point, they backtrack to a less complicated model (fewer non-linear terms) and decrease the weights of the parameters (take note of a feature but not put too much emphasis on it.)

Once they have found this happy middle ground with a more complicated model, but without the expense of collecting more data, they will try it out on a completely new set of data.

Break with common practice in psychology

For any psychologists reading this

  • This kind of thinking provides us with a possibility of getting away from models that have been stagnant for decades.  Many of these models predict the present so-so and the future so-so.  Here is the opportunity to break away.
  • Note that machine learning specialists use procedures that look like statistics but abandon the central idea of statistics.  They aren’t promising that their original sample was randomly chosen and they aren’t directly interested in the assertion that “if and only if our original sample was random, then what we found in the sample generalizes to other samples that have also been chosen randomly”.  Though they do something similar (taking lots of randomly chosen slices of data from the data they have), they aren’t in the business of asserting the world will never change again.  They have high speed computers to crunch more data when it becomes clear that the world has changed (or that our model of the world is slightly off).
  • Many of the rules-of-thumb that we were once taught fall away. Specifically, get a large sample, keep the number of features below the size of the sample, keep the model simple – these prescriptions are not relevant once we change our starting point.  All we want to find is the model that can generalize from one situation to another with the least error and high speed computers allow us both to use more complicated models and recomputed them when the world they described changes.

I am still to see good working examples outside marketing on the one hand and robotics on the other, but it seemed worth while trying to describe the mental shift that a classically trained psychologist will go through.  Hope this helps


Education level that was good for the top 3% is now necessary for all but the bottom 3%

Life can only be understood backwards; but it must be lived forwards. — Soren Kierkegaard

Good to remember!

Though most people are better living forwards that understanding backwards

In industrial psychology, we distinguish between “tracking” and “diagnosis”.

Take a pilot landing a large plane, for example. They assimilate a lot of information, that changes before they have time to put it into words, and bring the plane down, hopefully, to a gentle landing at high speed.

When, God forbid that something goes wrong, highly trained investigators will come in to work out what happened. The investigators aren’t likely to be pilots and the investigators probably don’t land fully laden passenger jets.

We have specially trained people to think backwards

In factories, we make the same distinction. We have hands-on people who keep complex, continuous flow plants going, safely.  It’s as demanding as landing a plane.

Yet, the day the process breaks down, we call the process engineers. They work out what went wrong and bring science to bear to figure out what the factory managers can do to get the plant going again.

The two people groups aren’t interchangeable. Simply, the managers think forwards. The engineers think backwards.

Usually the engineers are more highly educated. They often earn more.

But they aren’t “line”. And the “line” thinks they are egg-heads because they can’t do the “real thing”.

So it is funny that we have to be reminded not to think backwards. Most of us don’t. Most of us can’t. We need experts for that.

In the future, we might have to do thinking forward as well as thinking backwards

What has been puzzling me recently, or truthfully what is in my in-basket marked “puzzles”, is how the “design-thinking” approach to management will change this divide.

Take Toyota, for example. Every worker on the assembly line is capable of doing quite sophisticated experiments.  They use statistics equivalent to Honours in any subject except statistics itself.   The two types of work seem to be merging.

The idea of ‘failing informatively’ will also change what professions like engineering and psychology learn and contribute in the work place. We will not only be required to diagnose what went wrong. We will be required to play a more hands-on role in moving things forward.

This is the age of statistics

The attitude of Google to data makes simple A B experiments a day-to-day job rather than the job of an expensive graduate. The burgeoning use of good visuals makes statistics a discipline of communication.

I sense there is more to this change than I am saying here. What is clear though, is that the education levels that used to be regarded as the preserve of the top 3% of the population are now necessary for all but the bottom 3%.  Necessary. Not optional.

How can every child learn statistics?

So what are we going to do about illiteracy in Western countries?   It amazes me that people who cannot read books play computer games quite well.

So I doubt this is a real problem. We need to get kids into factories where they see statistics being used

And then they can teach us!