Skip to content →

Month: November 2011

10 questions to get ready for moving past the financial crisis

In my rough-and-ready barometer of where we are in dealing with the financial crisis, I note that #occupylsx has got people talking.  That’s great. Instead of mumbling into our beer, we are talking.  That’s a far cry from doing, of course.

Using the denial, anger, bargaining, depression, adjustment cycle, I reckon we are up to bargaining. We still believe we are going to make this go away.

When we do get around to adjusting and getting on with life, we are going to need to be well informed about what we can do and whom we can do it with.  If you intend to be around when we get round to sorting things out, these questions might help disentangle the issues.

What do you feel, and what do others feel about these issues?

#1 Are people in the UK angry?

#2 Are all people in the UK angry to the same degree?  And if not, who is more or less angry and what has led us to that opinion that our levels of anger differ?

#3 Is everyone who is angry, angry about the same things?  And with the same people?

#4 Who is angry with you and how do you feel that some people are angry with you?

#5 Who many people in the UK are of working age? How many people in the UK work in banks and the financial services?  How many people do you know who work in these industries?

#6 How much money does our government need each year to run the schools, the hospitals, the roads, the police, the fire service, the army, the navy, the airforce?

#7 How much money do the banks and financial services kick-in to the cost of running our government?

#8 What are the various things we could “do” to the banks?  Which three seem to be the most popular?

#9  When we “do” these things to the banks, what jobs will be created and which will be lost?  Who will be the winners and losers?

#10 When we “do” these things to the banks, what will be the amount they kick-in to the cost of running the government?  Will that be more or less and if it is less, how can we make up the shortfall?


Leave a Comment

10 steps to build a spam catcher

Here are the ten broad steps to build a spam catcher

  1. Get a sample of emails that are known to be spam or not spam. Split the sample 60:20:20 to provide a “training” set, a “cross-validation” set and a “test” set.
  2. Turn each email into a list of words by
    • Stripping out headers (if not part of the spam test) and other redundancies
    • Running NLP software to record the stem of a word only (for example, record city and cities as cit)
  3. Count the number of times each unique word appears in the sample and order the list so that we can use the top 100 or 10 000 or 50 000 (whatever) to check for spam.  Remember to use stemmed words!
  4. Convert the list of words in each email into a list of look-up numbers by substituting the row number of the word from the dictionary we made in Step 3.
  5. For each email, make another list where row 1 is 1 if the first word in the dictionary is present in the email, where row 2 is the 1 if the second word in the dictionary is present in the email. If the word is not present, leave the value for that row as zero. You should now have as many lists are you have emails each with as many rows as you have words in your spam dictionary.
  6. Run a SVM algorithm to predict whether each email is spam (1) or not spam (0).  The input is the list of 1s and 0s indicating which words are present in the email.
  7. Compare the predictions with the know values and compute the percentage correct.
  8. Compute the predictions on the cross-validation set and tweak the algorithm depending on whether the cross-validation accuracy is too similar to the training accuracy (suggesting the model could be stronger) or too dissimilar (suggesting the model is too strong).
  9. Find the words most associated with spam.
  10. Repeat as required.


Leave a Comment

See the financial crisis as a chess game with 4 pieces

Since the Northern Rock crash, when was it, 2007? I’ve been using the Kubler Ross Grief Cycle to track where we are in dealing with the financial crisis.

Kubler Ross Grief Cycle and the Financial Crisis

  • Denial took a looooong time. We got Lehman’s  in 2008.  I think people have got it now.  We know we are trouble.
  • Anger started kicking in when? At first we were vaguely angry with Lehmans. Then we muttered when our own incomes were affected.  And possibly we took our anger on targets as various as local corner shop and politicians.
  • Bargaining is next . . . if I do this, then . . . then the problem is going to go away.  A well-educated experienced American knowledge entrepreneur over here in UK looking for backing put the mind-set well.  “Promises were made”.  “The middle-class were promised . . .”  That is what bargaining means. We think we can still make the problem go away and with very little effort at that.
  • Depression should follow in this rough order of psychological states.  We might have thought that Western countries have been depressed for a long time with most people “sleep walking” through life, nursing a hangover and waiting for early retirement.   So I am not looking forward to seeing what that looks like in a more severe form.  At the rate we are going, next year maybe?
  • Adjustment  . .  and eventually we get sick of being depressed so we get out of bed one morning and decide to get on with it.

Being positive in the face of the worst financial crisis

Being an impatient soul, I’ve kept an eye out for simple models that can help people Act, Do, & Get On With It.

On Al Jazeera today, there is a blog by Mohamed A. El-Erian is CEO and co-CIO of PIMCO, and author of When Markets Collide.

El-Erian finishes his article reassuring us that we

need not be paralysed by uncertainty and anxiety. Instead, we can use this simple framework to monitor developments, learn from them, and adapt. Yes, there will still be volatility, unusual strains, and historically odd outcomes. But, remember, a global paradigm shift implies a significant change in opportunities, and not just risks.

A framework for understanding how the financial crisis will unfold

So what is the framework?  El-Erian suggests that each country, or large community, will decide for itself what it will do about four things.  Each community will make a move. Then we will watch  how it goes (and what everyone else does).  And then we will make another move.

We don’t know the outcome of our collective actions in advance but we can think of this as a game of chess with four pieces each (instead of 6) and many many players (not just 2).

The four pieces at each community’s disposal


We have overspent and “borrowed from the future”.  Whatever we cannot pay to the future, the future must write-off as a bad debt.  That’s the stark situation that we are in.

In a large community, we are going to divide up the bad debt between us.  The question is who should pay more and who should pay less.

Countries squabble at home about the formula for dividing up the debt, and the formula is important, but this squabble goes under a separate heading below.

First, the country as a whole must deleverage.  If we don’t take responsibility for our overspending (sins of the fathers visited on the sons etc), then our creditors will take charge of our assets.  This is called getting a bailout from the IMF.  If you have ever seen this in play, you don’t want to go there. Believe me.

Economic growth

OK, we got into debt because we were partying and spending more than we earn.  How are we going to earn more?

Sometimes the problem is structural. It is hard for me to get off my proverbial backside because I am locked into a system.  So what is locking down the energy in a country?  That is the question we ask.

Sadly sometimes, lack of growth is not economic. It is psychological. A tweet went the rounds this morning:  nothing will happen while we hope to become members of the billionaire club.  When the easy life is the focus, we ain’t going to be growing.  If our goal is early retirement – enough said?

So let’s hear from the economists. And while they have their arguments, the rest of us can focus relentlessly on what can be done and work with people who want to work.  And I mean focus relentlessly.  It is so easy to get distracted.

Social justice

Now the biggie.  Social justice has declined in the west. And there we were selling democracy to whoever would buy. So the Occupy movement use as their catch phrase: we are the 99%.

The real issue is still economic.  In my naïve economic take, the issue is how do we accumulate capital and what do we use it for?

When a government makes good free schooling available for a child from year 5 to year 16, we are investing our savings in that child.  And we expect a return. When they are older, not only will they join in sophisticated businesses that already exist, they will invent new businesses and keep the show on the road when we are old and are slowing down.  Education is just capital accumulation one person at a time.

Much of the problem in the west is that money has gone into partying at all levels.   Money accumulates were it isn’t doing the work of capital – by which I mean taking from the present to invest in the future.  We’ve been doing it backwards.  Leaving money doing very little while we borrow from the future to pay for today’s party.

The talk  is presently of who gets what. That is still partying.  We must put our shared capital where it can make a difference.  Education and health are no brainers.  We also have to look at all our assumptions about where we invest and why.  Simply, if you underpay a parent,  you are stripping them of capital. So don’t talk in the next breath talk about investing in early childhood education. You are frankly talking nonsense. Why not create the problem in the first place?  Personally, I’d look at all the laws that help keep workers insufficiently paid.  The simple test is could you run a household on the same income.  Could you?  If not you are running down the capital base of the country.

But the problem seems to be large and the real key seems to be to start to move forward somewhere.  The #occupy movement is people beginning to unravel the mess.  As El-Erain says, watch them. And don’t feel powerless. Make your choices.  Where ever Occupy ends, it is part of this piece on the board: Social Justice.


The fourth piece that we have to play with is leadership.  We complain endlessly about our politicians. El-Erain doesn’t say much in his blog. Maybe he talks about them more in his book.

Personally, is suspect that our leaders are not the piece. It is our attitude to leaders that is the piece.  I believe our leaders reflect us. Maybe when people become more accustomed to being politically active, then we will get better leaders.  OK. You tell me.

So if you are impatient waiting for people to move through the grief cycle, try seeing the world as chequer board with 4 pieces and many many players.  Track the apparent confusion and perhaps we can see what is happening, what is going to happen and what we can help make happen.


An organization: a place where we progressively learn to take responsibility for the whole

Not the cry, but the flight of a wild duck, leads the flock to fly and follow.

Chinese Proverb

I don’t know the provenance of this quote. I got it from @mr_gadget on Twitter. But I like it.

We follow when

  • We see someone move in a way that is not easily reversed.
  • When others copy.

Our reasoning, if we could call it that, goes something like this.

  • Whatever they have noticed must be really important – well really dangerous.
  • So I had better run too.

The ‘reasoning’ sucks. This is what is happening.

  • We are startled and our startle response unleashes a wave of adrenalin or noradrenalin and we have an overwhelming impulse to run.
  • And so we run.

When we think about what we have just done, we justify our actions by saying that there might have been danger. Well, we justify our actions by what Daniel Kahneman calls anticipating our future remembering selves. We don’t want to look back and say we didn’t move when we should have done. And of we are wrong, we can easily justify ourselves to ourselves because other people were alarmed too.  So running when other people run checks the boxes for the future remembering self.

Reacting in panic is a bad idea; keeping cover is a good idea

But really, some people are volatile rather than observant. They might react in alarm to just about anything and run straight into the jaws of a lion.

Basic military training is geared-up to teaching us not to start running every time we get a fright. We can learn something from the foot soldier. Our job is not to scamper about wildly in all directions but to remain under cover where we won’t get shot at.

My more exuberant character chaffes as the idea of taking cover. It smacks of fear and deprives me of what I like – wide open spaces with distant horizons. So let me develop that idea.

I am able to walk freely and joyfully in my wide open spaces, not because they are there – though that certainly helps.

I can walk in my fields because at a collective level we have institutions that keep us ‘under cover’. We have gun control (this is the UK not the US). We are relatively prosperous and you don’t get mugged (much) in the countryside. We have time (contrary to all the grumbling).

We have safe spaces and though we take them for granted, we keep them safe through collective action.

But can we be too safe?

Of course, people who have never lived in unsafe conditions might never develop any awareness of danger. They might even become rather silly and use their biological flight response for entertainment.  How can we design spaces so that we each have to do our fair share of being the proverbial sentry?  Can each of us ask “What lions and marauders do we look out for on behalf of the greater community?”

I think that is why children are given responsibilities early, in like: to take out the trash, to feed the dog. Thinking ahead and thinking broadly – well thinking – is what they are practising. When they have to take out the trash because we are too lazy to do it – that is different – we are using them as servants and not developing them at all.

Create environments where people increasingly take responsibility for the group

Yup, I think I got it. We will react like birds given half a chance. Many of us are bored so we are fascinated by the idea of mobilizing people with as little effort as a cry or an irreversible action. This cannot be our goal. This is what relatively mindless birds do.

Our goal, or at least my goal, is to create environments where people share the responsibility for creating a safe space and we start taking on responsibility in an age-related way – taking full responsibility for an important task to that we learn to think and not simply react like an impulsive creature.  So we start to take out the trash and start to think about the business of keep a place hygienic. And we move on and up, learning to weave many responsibilities together.

Good quote but a different conclusion! When the birds take off, I’ll sit tight. Rapid, panicky reactions are not what it is all about.

Leave a Comment

What is it like to study at uni and work with a famous Professor?

There are all sorts of jobs in this world that I call invisible jobs. You can walk along the High Street and not see them.  And indeed, sometimes you can see some a  job but not see what people do in the job.  When I come across good interviews or descriptions of hidden work, I grab them.

A few days ago, I came across this interview of Jennifer Widom, the head of computer science at Stanford  . .  to use British parlance. In American, Professor Widom is chair of the computer science department at  Stanford.

This interview is valuable in many respects.

  • The Professor is candid, without being forthright, about her career and her work-life balance.
  • She speaks with evident respect and affection for everyone around her, including students.
  • She describes the tacit knowledge (the how-to) of being a successful professor.
  • She is clear about the difference between a career in a university and a career in industry.
  • She nonetheless understands the connection between the two and how value moves from universities into industry and the working life of a nation.

I have written recently about the essence of university life. If you are thinking of going to universty, you should read this article.  You will be going to learn from people like this, because they are like this.  Professor Widom’s description will help you understand what studying at a university is like and why you want active researchers as your teacher.

It’s an easy to read interview and valuable for people trying to write up what is a mostly invisible job.  Above all, we read the story of a Professor who is warm, generous and down to earth.


Leave a Comment

Learning curves and modelling in machine learning

In this post, I am going to describe what I have just learned from Andrew Ng at Stanford about “learning curves”.  To computer scientist, a learning curve is what you might expect but describes how well data has been modeled.

I write this as a classically trained psychologist and it is clear that if we are to understand machine learning, we have to watch out for where the thinking of computer scientists differs radically from our own.  This is my commonsensical comparison of the two approaches.  I am writing it down to make sure I have followed what I heard.  It is rough and ready but may help you understand the differences between the two disciplines.

A learning curve in CS

Simply, the CStists take random samples of data where the first sample is very small, let’s say 1 because that is helpful to understanding the logic, and the last sample will be large, let’s say a few thousand.  This is random samples from the same large data set.

Generally, with a sample of 1 up to 3, we can model perfectly.  However, when we try the same model with another sample of the same size, the model will not predict well at all. The amounts of error for the experimental sample and the comparison sample will be hugely different.  So far so good. That’s what we all learned at uni.  Modelling on a small sample is the equivalent of an ‘anecodote’.  Whatever we observed may or may not transfer to other situations.

As we increase our sample size, paradoxically the amount of error in our model increases but the amount of error in our comparison situation decreases.  And ultimately, the error we are making in the two situations converges.  We also know this from uni.

Much of our training goes into getting us to do this and to increasing the sample size so that the error in the hypothetical model goes up, and the error in the comparison model goes down.  Plot this on a piece of paper with error on the y axis and sample size on the x axis.

When the two error rates converge, that is we can explain the future as well as we can explain the present, then we stop and say, “Hey, I have found a scientific law!”

I would say that our willingness to tolerate a more general description of a particular situation so that we can generalize at the same level of accuracy (and inaccuracy) to another situation is one of the hallmarks of uni training. This is so counter-intuitive that many people resist so it takes uni training to get us to do it.

What the computer scientists implicitly point out is that the converse is also true. We are now able to explain the future as badly as we explain the present!  They call this underfitting and suggest that we try another model to see if we can do a better job of explaining the present.  So we will stop increasing the sample size and start playing with the model. We can vary the form of the model, typically moving from a linear to a non-linear model (that is adding more features) and increasing the weights of the parameters (go from a loose floppy kind of model to a stiffer model, if you like).

They do this until the model overfits. That is, until our explanation of the present is very good but the same explanation produces errors in comparison situations.  When they reach this point, they backtrack to a less complicated model (fewer non-linear terms) and decrease the weights of the parameters (take note of a feature but not put too much emphasis on it.)

Once they have found this happy middle ground with a more complicated model, but without the expense of collecting more data, they will try it out on a completely new set of data.

Break with common practice in psychology

For any psychologists reading this

  • This kind of thinking provides us with a possibility of getting away from models that have been stagnant for decades.  Many of these models predict the present so-so and the future so-so.  Here is the opportunity to break away.
  • Note that machine learning specialists use procedures that look like statistics but abandon the central idea of statistics.  They aren’t promising that their original sample was randomly chosen and they aren’t directly interested in the assertion that “if and only if our original sample was random, then what we found in the sample generalizes to other samples that have also been chosen randomly”.  Though they do something similar (taking lots of randomly chosen slices of data from the data they have), they aren’t in the business of asserting the world will never change again.  They have high speed computers to crunch more data when it becomes clear that the world has changed (or that our model of the world is slightly off).
  • Many of the rules-of-thumb that we were once taught fall away. Specifically, get a large sample, keep the number of features below the size of the sample, keep the model simple – these prescriptions are not relevant once we change our starting point.  All we want to find is the model that can generalize from one situation to another with the least error and high speed computers allow us both to use more complicated models and recomputed them when the world they described changes.

I am still to see good working examples outside marketing on the one hand and robotics on the other, but it seemed worth while trying to describe the mental shift that a classically trained psychologist will go through.  Hope this helps


Leave a Comment

Why be bothered with a university education?

The distinction between university education and other post-school education can be hard to grasp. Many emotional arguments are advanced. “We only need a handful of people who speak Latin” is one argument, for example, that came up in my Twitter stream.  Often, our argument expresses no more than the emotions we are experiencing as the world shifts about and what we do or have done seems more or less highly valued.

In an earlier post, I tried to list three features of university life which make university education worthwhile though hard to understand from the outside, and hard to get used to when you first arrive as a first year fresh from high school.

I am doing a uni course right now coming from the other direction.  I have a lot of hands-on experience in a field and I wanted to work backwards – so to speak – and formalize my knowledge.  I find the course fairly frustrating because I cannot always relate what I am hearing to a practical situation and some of  the practical exercises are simply better done with a combination of a crib sheet and some trial and error.

So I have to ask myself : why am I still there?  Why haven’t I transferred to a polytechnic type college which would be better organized (from the students’ point of view) , and where the lecturing would frankly be more coherent and the exercises better thought out?

So I’ve had to write down my thoughts (to get them out of my head) and they may be useful to you.

#1 Professors tell the story of abstractions and the failure of abstraction

The job of a professor is to look out on the world and to describe what is common across a whole set of similar situations.

When they do a good job, we can use their generalization like a formula.  I can convert Celcius to Farenheit, for example.

Or, in the case of my course, I can understand how to set up some tables and store them in a database in the most efficient way possible.

The difficulty comes when the generalization or abstraction

a)       Is already known in real life (there have been people making almanacs and look-up tables for generations)

b)       And it turns out the generalization solves some problems but not all (or creates a few side-effects).

The professors then go back to the ‘drawing board’ and try to solve the problem with their own abstraction that they just created with their solution!

This drives students crazy, particularly the more practically minded.  They don’t really want to know this long story of

  • Make this formula
  • Oh. Oooops!
  • Well make this formula.  It is better.
  • Oh. Ooops!

This is particularly annoying to students when they have the spoiler and know the current state of best practice (and are perhaps of slightly impatient temperament).

But. this is what professors know about it and after all, there is not much point in asking them about things they don’t know about!

So the question becomes – shall I keep asking them, or shall I ask someone else?

# 2 Professors prepare you to manage the interface of new knowledge and reality

Well, let’s fast forward a bit to 10 or 20 years’ time when knowledge has advanced.  Of course, you can just go on another course.  And you probably will  go on another course to find out the new way of doing things.

But let’s imagine you are pretty important now and it is your job to decide whether to spend money on this new knowledge, spend money and time on the courses, and to decide whether or not to changing work practice to use the new ideas.

Of course you can find out the new method in a course.  Of course, you can hire consultants to give you the best guess of whether your competitors will use the new knowledge and how much better than you they will be when they put it into play.

There is another question you must ask and answer even if you answer it partly by gut-feel.  You must anticipate what the professors have not answered.  What will be their Oh. Ooops!  Your judgement of the Oh. Ooops! tells you the hidden costs.  The company that judges those correctly is the company that wins.

Everyone will pick up the new knowledge.  That’s out there. Everyone and his dog will take the course and read the book.  What we will compete on is the sense of the side-effects. That thalidomide will be a disaster.  That going to war will increase attacks on us.  To take to well known examples.

The professors won’t be articulate about the side-effects of their new solution. Not because they are irresponsible but because their heads are fully taken up figuring it out.

It is the leaders in charge of the interface between new knowledge and the real world who must take a reasonable view of the risks.  Just as in the banks, it is the Directors who are responsible for using technology that had unexpected side-effects.

When you are the Director, you want a good sense of the unknown unknowns and you develop that sense by listening to Professors. They tell you story of how we found the general idea and then went Oh. Ooops!.  The story can be irritating because it is mainly the story of cleaning up their own mess and sometimes the whole story is nothing more than Oh. Oops!  that ends with “Let’s give this up and start on another story”.

But as future leaders students, practise listening to experts at the edge of knowledge, relating the solutions to real world problems, and getting a good sense of the Oh. Ooops! that is about to come next!

#3 Uni education can feel complicated and annoying

That’s uni education.  Don’t expect it to be a movie that charms you and tickles you ego.  It is irritating.

But rather be irritated there than create a medical disaster, a ship that sinks or a financial system that collapses when you were in charge and jumped into things naively.

See you in class!


Leave a Comment

Back propagation for the seriously hands-on

I have just finished the Stanford back propagation exercise, and to put it mildly, it was a ****.

So is back propagation complicated?  And indeed what is it?

These are my notes so that I don’t have to go through all the pain when I do this again.  I am not an expert and the agreement with Stanford is that we don’t give away the answer particularly at the level of code.  So use with care and understand that this can’t tell you everything.  You need to follow some lecture notes too.

Starting from the top: What is back propagation?

Back propagation is a numerical algorithm that allows us to calculate an economical formula for predicting something.

I am going to stick to the example that Stanford uses because the world of robotics seems infinitely more useful than my customary field of psychology. Professor Ng uses an example of handwriting recognition much as the Royal Mail must use for reading postal codes.

We scan a whole lot of digits and save each digit as a row of 1’s and 0’s representing ink being present on any one of 400 (20×20) pixels.  Can you imagine it?

Other problems will always start the same way – with many cases or training examples, one to each row; and each example described by an extraordinary large number of features. Here we have 400 features or columns of X.

The second set of necessary input data is one last column labeling the row.  If we are reading digits, this column will be made up of digits 0-9 (though 0 is written down as 10 for computing reasons).  The digit is still 0 in reality and if we reconstructed the digit by arranging the 400 pixels, it will still be seen to the human eye as 0.

The task is to learn a shorthand way for a computer to see a similar scan of 400 pixels and say, aha, that’s a 1, or that’s a 2 and so on.

Of course the computer will not be 100% accurate but it will get well over 95% correct as we will see.

So that is the input data: a big matrix with examples along the rows of features and with the last column being the correct value – the digit from (10, 1-9) in this case.

How does back propagation work?

Back propagation programs work iteratively without any assumptions about statistics that we are used to in psych.

The computer boffins start by taking a wild guess of the importance of each pixel for a digit, and see what the computer would predict with those weights.  That is called the forward pass.

Then based on what the computer got right or wrong, they work backwards to adjust the weights or importance of each pixel for each digit.

And remembering that computers are pretty fast, the computer can buzz back and forth asking “how’s this?”.

After a set number of trials, it stops improving itself and tells us how well it can read the digits, i.e., compares its answers to the right answers in the last column of our input data.

What is a hidden layer?

Back proagation also has another neat trick.  Instead of using pixels to predict digits, it works with an intermediate or hidden layer.  So the pixels predict some units in the hidden layer and the hidden layer predicts the digits.  Choosing the number of units in the hidden layer is done by trying lots of versions (10 hidden units, 50 hidden units, etc) but I guess computer scientists can pick the range of the right answer as they get experienced with real world problems.

In this example, the solution worked with 25 hidden layers.  That is, 400 pixels were used to make predictions about 25 units which predict which of 10 digits made the data.

The task of the computing scientist is to calculate the weights from the pixels to the hidden layers and from the hidden layers to the digits and then report the answer with a % of “training accuracy” – over 95%, for example.

Steps in back propagation

We have already covered the first four  steps

Step 1: Training data

Get lots of training data with one example on each row and lots of features for each example in the columns.

Make sure the row is labeled correctly in the last column.

Step 2:  Decide on the number of units in the hidden layer

Find out what other people have tried for similar problems and start there (that’s the limit of my knowledge so far).

Step 3: Initialize some weights

I said before, we start with wild guess.  Actually we start with some tiny numbers but the numbers are random.

We need one set of weights linking each pixel to each hidden layer (25 x 400)* and another set linking each hidden layer to each digit (10 x 25)*.

The asterisk means that a bias factor might be added in raising one or the other number by 1.  To keep things simple, I am not going to discuss the bias factor. I’ll just flag where it comes up.  Be careful with them though because I am tired and they might be wrong.

Step 4: Calculate the first wildly inaccurate prediction of the digits

Use the input data and the weights to calculate initial values for the hidden layer.

Our input data of training examples and features (5000 examples by 400 pixels) is crossed with the appropriate initial random weights (25 x 400) to get a new matrix of hidden layer values.  Each training example will have 25 new values (5000 x 25)*.

Then repeat again from the hidden layer to the layer of digits or output layer making another matrix of 5000 x 10.

In the very last step, the calculated value is converted into a probability with the well know sigmoid function.  It would be familiar if you saw it.  I’ll try to patch it in.

The values calculated at the hidden layer are converted into these probability-type values and they are used for the next step and the final answer is converted in the same way.

Now we have a probability type figure for each of 10 digits for each training example (5000 x 10)*.

Step 5: Find out how well we are doing

In this step, we first convert the correct answer (which was a 1, or 5, or 7 or whatever the digit was) into 1’s and 0’s – so we have another matrix (5000 x10).

We compare this with the one we calculated in Step 4 using simple subtraction and make yet another matrix (5000 x 10).

Step 6:  The backward pass begins

So far so good.  All pretty commonsensical. The fun starts when we have to find a way to adjust those guessed weights that we used at the start.

Staying at a commonsensical level, we will take error that we have in that big 5000 x 10 matrix calculated in Step 5 and partition it up so we can ‘track’ the error back to training examples and hidden layers and then from hidden layers to pixels. And this is what the computing scientists do.  T

hey take one training example at a time (one of the 5000 rows), pick out the error for digit 1, and break it up.  And do it again for digit 2 up to digit 0 (which we input as 10).

Step 7: Working with one training example at a time

It might seem odd to work with one training example at a time, and I suspect that is just a convenience for noobes, but stick with the program.  If you don’t, life gets so complicated, you will feel like giving up.

So take example one, which is row 1; and do the stuff. And repeat for row 1, and so on until you are done.

In computing this is done with a loop: for 1: m where m is the number of training examples or rows (5000 in our case).  The machine is happy doing the same thing 5000 times.

So we do everything we did before this step but we start by extracting our row of features:  our X or training data how has 1 row and 400 features (1 x 400)*.

And we still have one label, or correct answer but remember we will turn that into a row of 1’s and 0’s.  So if the right answer is 5, the row will be 0000100000 (1 x10).

And we can recalculate our error, or uplift the right row from matrix of observed values that we calculated in Step 6.  The errors at the ‘output_layer’ will be a row of ten numbers (1 x 10).  They can be positive or negative and the number bit will be less than 1.

Step 8: Now we have to figure out the error in the hidden layer

So we know our starting point of pixels (those never get changed), the correct label (never gets changed) and the error that we calculated for this particular forward pass or iteration.  After we adjust the weights and make another forward pass, our errors change of course and hopefully get smaller.

We now want to work on the hidden layer, which of course is hidden. Actually it doesn’t exist.  It is a mathematical convenience to set up this temporary “tab”.  Nonetheless, we want to partition the errors we saw at the output layer back to the units in the hidden layer (25 in our case)*.

Just like we had at the output layer, where we had one row of errors (1 x 10), we now want a row or column of errors for the hidden layer (1 x25  or 25 x 1)*.

We work out this error by taking the weights we used in the forward pass and multiplying by the observed error and weighting again by another probabilistic value.  This wasn’t explained all that well. I’ve seen other explanations and it makes intuitive sense.  I suspect our version is something to do with computing.

So here goes.  To take the error for hidden layer unit 1, we take the ten weights that we had linking that hidden unit to each digit.  Or we can take the matrix of weights (10 x 25)* and match them against the row of observed errors (1 x 10).  To do this with matrix algebra, then we turn the first matrix on its side (25 x 10) and the second on its side (10 x 1) and we the computer will not only multiply, it will add up as well giving us one column of errors (1 x25).*   Actually we must weight each of these by the probabilistic type function that we called sigmoidGradient.

We put into sigmoidGradient a row for the training example that was calculated earlier on as the original data times the weights between the pixels and the hidden layer ((5000 x 400*)  times  (25 x 400*))– the latter is tipped on its side to perform matrix algebra and produce a matrix of 25* values for each training example (5000 x 25*).

Picking up the column of data that we calculated one paragraph up, we now have two columns (25* x1) which we multiple (in matrix algebra .* so we can do multiplication of columns like we do in Excel).

Now we have a column of errors for the hidden layer for this one particular training example (25* x1).  (Our errors at the output layer for this person was in a row (1 x 10).

Step 9: Figure out how much to adjust the weights

Now we know how much error is in the output layer and the hidden layer, we can work on adjusting the weights.

Remember we have two sets of weights.  Between the output and hidden layer we had (10 x 25*) and between the input layer and the hidden layer, we had (25 x 400*). We deal with each set of weights separately.

Taking the smaller one first (for no particular reason but that we start somewhere), we weight the values of the hidden layer with the amount of error in the output layer.  Disoriented?  I was.  Let’s look again what we did before.  Before we used the errors in the output layer to weight the weights between output and hidden layer and we weighted that with a probabilistic version of input data times the weights coming between input and hidden layers.  That seemingly complicated calculation produced a set of errors – one for each hidden layer – just for this training example because we still working with just one row of data (see Step 8).

Now we are doing something similar but not the same at all. We take the same differences from the output layer (1 x10) and use them to weight the values of the hidden layer that we calculated on the forward pass (1×25*).  This produces (and this is important) a matrix that will have the same proportions as the weights between the hidden and output layer.  So if we have 10 output possibilities (as we do) and 25* units in the hidden layer, then at this stage we are calculating a 10 x 25* matrix.

So for each training example (original row), we have 250 little error scores, one for each combination of output and hidden units (in this case 10×25*).

Eventually we want to find the average of these little errors over all our training examples (all 5000), so we whisk this data out of the for loop into another matrix.  As good programmers, we set this up before and filled it up with zeros (before the for loop started).  As we loop over training examples, we just add in the numbers and we get a total of errors over all training examples (5000) for each of the combos of hidden unit and output unit (10 x25*).

And doing it again

We have a set of errors now for the connections between hidden and output layers. We need to do this again for the connections between the input layer and the hidden layer.

We already have the errors for the hidden layer (25* x1) (see Step 8).  We use these to weight the input values (or maybe we should think of that the other way round – we use the input values to weight the differences).

We take the errors for the hidden layer (25 x1) and multiple by the row of original data ( 1 x 400*) and we will get a matrix of (25 x 400*) – just like our table of weights!  You might notice I did not put an asterisk on the 25 x1 matrix.  This is deliberate.  At this point, we take out the bias factor that we put in before.

We do the same trick of storing the matrix of error codes (25 x 400*) in a blank matrix that we set up earlier and then adding the scores for the next training example, and then the next as we loop through all 5000.

Step 10: Moving on

Now we have got what we want: two matrices, exactly the same size as the matrices for the weights ( 25 x 400* and 10 x 25*).  Inside these matrices are the errors added up over all training examples (5000).

To get the average, we just have to divide by the number of training examples (5000 in this case). In matrix algebra we just say – see that matrix? Divide every cell by m (the number of training examples). Done.

These matrices – one 25 x 400* and the other 10 x 25* are then used to calculate new tables of weights.  And we rinse and repeat.

  1. Forward pass : make a new set of predictions
  2. Back propagation as I described above.
  3. Get two matrices of errors: yay!
  4. Recalculate weights.
  5. Stop when we have done enough.

The next questions are how are the weights recalculated and how do we know if we have done enough?

Recalculating weights

The code for the back propagation algorithm is contained within a function that has two purposes:

  • To calculate the cost of a set of weights (average error in predictions if you like)
  • And the matrices that we calculated to change the weights (also called gradients).

The program works in this order

  • Some random weights
  • Set up the step-size for learning (little or big guesses up or down) and number of iterations (forward/back ward passes)
  • Call a specialized function for ‘advanced optimization’ – we could write a kluxy one but this is the one we are using
  • The advanced optimizer calls our function.
  • And then performs its own magic to update the weights.
  • We get called again, do our thing, rinse and repeat.

How do we know we have done enough?

Mainly the program will stop at the number of iterations we have set.  Then it works out the error rate at that point – how many digits are we getting right and how many not.

Oddly, we don’t want 100% because that would probably just mean we are picking up something quirky about our data.  Mine eventually ran at around 98% meaning there is still human work and management of error to do if we are machine reading postal codes.  At least that is what I am assuming.


There you have it.  The outline of the back propagation.  I haven’t taken into account the bias factor but I have stressed the size of the matrices all the way through, because if there is one thing I have learned, that’s how the computing guys make sure they aren’t getting muddled up.  So we should too.

So now I will go through and add an * where the bias factor would come into play.

Hope this helps.  I hope it helps me when I try to do this again.  Good luck!

The regularization parameter

Ah, nearly forgot – the regularization parameter.  Those values – those little bits of error in the two matrices that are the same size as the weights – (25×400*) and (10×25*)?

Each cell in the matrix except for the first column in each which represents the bias factor, must be adjusted slightly by a regularization parameter before we are done and hand the matrices over to the bigger program

The formula is pretty simple.  It is just the theta value for that cell times by the learning rate (set in the main program) and divided by the number of training cases.  Each of the two matrices is adjusted separately.  A relatively trivial bit of arithmetic.



Need to practice first order logic?

I found this first order logic exercise on Wolfram.

#1 Download Wolfram’s CDF player

-1 The download on their site did not work for me, so I downloaded here from Softpedia.

-2 You will download an .exe file. When it arrives on your personal computer, simply click on the link and it will install as a Program.  It takes a little time to install.  Big beastie to allow you to view interactive documents.

#2 Now read whatever you want on Wolfram’s Demonstrations

-1 Find the demonstration that interests you.  In this case, try this demo for practicing first order logic, also known as predicate calculus..

-2 Click on “Download Demonstration as CDF” at top right and it should open.  If not, try firing up Wolfram first from your Start/Programs.

#3 Practice your first order logic

-1 Choose how many objects to play with,

-2 Start at equation number 1.

-3 Move objects around to change the truth value from true to false and v.v


It won’t do your homework for you but it might take the edge off the confusion.


Leave a Comment

Check your propositional logic with a truth table generator

The Seventh Day Adventist University website has a truth table generator for checking propositional logic.  Instructions for inputting propositional logic symbols are on its page.

My host’s wordpress is borked: so here is the link

#1 Check you understand each part of the assertion

Basically, you can check that you are using the basic truth table for simple assertions like (A and B).

#2 Generate a truth table for multiple assertions

And you can combine simple assertions to generate a truth table


I am not an expert in this, but I am assuming that if a bundle of assertions are always true,whatever the starting values that we put into the bundle, then the bundle resolves to true.

Correspondingly, if the assertions come out as false, no matter what the starting values are, the bundle resolves to be false.

And if the bundle contains a mix of true and false, we are left uncertain what will happen.

Any thoughts?


Leave a Comment