# Tag: machine learning

This post is a step-by-step, practical, guide to principal components analysis.  It’s very hands-on and “common sensical”.  If any experts out there spot an egregious error that would horribly mislead a beginner, please do let me know.

I’ll simply work through 4 steps and then sum up as 5 steps in a slightly different order.

### #1 Data

I always like to start any data problem by thinking about my data rather concretely.

In a PCA problem, imagine a spreadsheet which has hundreds or even thousands of columns – far too many for comfort.

On each row of the spreadsheet is a case, or ‘training example’ in machine learning parlance.

What we want to do is to find the columns that matter.  Alternatively, we ask “which columns could we bundle together into computed columns so that we have a more manageable number?”

In short, this procedure will tell us which columns we can leave out and which one’s we should bundle together.  The bundles will be the principal components.  And the final answer will tell us how much precision we have lost by bundling scores rather than using original raw data.

So, we begin with data which is laid out in matrix X with m rows and n columns (m x n).

### #2 Normalize the data

The data comes in all sizes – little numbers and big numbers, very spread out and bunched together.  First we smooth the data in the same way that tests are normed at college.  Simply, we convert each column to a mean of zero and a standard deviation of one.

To be very clear how this works, we take each cell and adjust the number in the cell depending on the other numbers in the column. If we were working with spreadsheets, we would open another spreadsheet of exactly the same number of rows and columns and add this formula to each cell. So for cell A1 we would have:

= (Sheet1!A1 – mean(ColumnA))/StdDev(ColA)

When we calculate the mean and stddev of the columns in the new spreadsheet, they will all be 0 and 1 respectively.

### #3 Principal Component Analysis

Now we find the ‘bundles’ of columns.

In my old days of statistics packages, the program would return a table which listed all the columns down the page and then produced factor loadings or weights for a whole heap of factors (or bundles) in more columns.  The number and the sign would tell you the weight of the original data column in the new ‘bundle’.  The weights varied from -1 through 0 to +1.

In Octave, the free version of Matlab, there is a facility to do PCA in two steps:

#### Step #3 Part One

• Compute what is called the covariance matrix.  Simple imagine taking  a copy of the spreadsheet (the second one), multiplying it cell to cell (A1 to A1) and taking the sum of those squares in the new column A, then A1 to B1 and taking the sum of the product in column B, then A1 to C1.. etc until we have new row with N columns each got by multiplying two columns and adding up the product. You’ll have to try it yourself.  I’ll have to get out pen and paper when I read this a year from now.
• Then we do the same starting with Col B and Col A (that’s a repeat, I know.. stick it in), B to B, B to C  etc.
• Until we have a new matrix with N columns and N rows.  (Yes – this is what computers are for).
• And one more sub- step – divide every cell by the original number of cases or training examples (i.e., rows in the very first spreadsheet).

That’s the covariance matrix.  In Octave, which uses linear algebra, it is much easier.  You just tell the machine to multiply the transpose of the normalized data by the normalized data and divide by m – one line of code.

CovarianceMatrix = (X’ * X )/m

(That’s what computers are for!.. the explanation was just so you have a concrete idea of where the data came from and what happened to it).

#### Step #3 Part One

The second step in PCA is to find the bundles using a function that is built into Octave called the ‘singular value decomposition’ or SVD.

All you do is ask for it and it ‘returns’ three matrices, U, S and V and we are going to use U and S.

U gives us a matrix exactly the same size as the covariance matrix.  Each column now refers to a ‘bundle’. The rows still refer as before to the features (that is the original columns in the data matrix and the normalized data matrix. Have a quick check.  I’ll wait!).

Note we have as many bundles as we had columns right at the start but we don’t want to use all the unbundles (columns in the U matrix) otherwise we will have exactly the same number of columns as when we started – no point, hey?

So we will only use as many, starting from the left, as we need.  We will decide how many to use on the basis of the S matrix.

Let’s do that later and  consider further what U actually tells us.  As we read down column one, cell A1 tells us the loading of original column A, or the 1st feature, on the new bundle 1.  Cell A2 tells us the loading of original column B or the 2nd feature, on new bundle 1. And so on.  Got it?  Try it out.

So what can we do with this?  Two things –

• We can see which of our original columns were the most important.  They are the ones with the biggest numbers in column on the left and subsequent columns as you move right.   A positive number means the higher the original number the higher would be the bundle score. A negative number in this new table means the higher the number in the original table, the lower would be the bundle score.  Good to know if two of our original columns pull in  the opposite directions. So that is the first use – to understand the original columns and how they hang together.
• The second use is to create a simplified data set.  Ok, we hate it when bureaucrats create scores for us – like a credit rating. But imagine the rows are just pictures and the columns were the pixels or colors at 10 000 points on page – collapsing 10 000 columns into 1000 columns or 100 columns could be handy for data compression.  (We’ll work out later how much data is lost – or how much blur is added!)  So how do we recreate the scores?  We will come back to this – lets stick with understanding what those numbers in the U matrix mean. All we have to do to get a score for the bundle  is take the number in the U matrix for original column A (now in row 1) and multiple it by the score for the first case in column A  (do it on a bit of paper).  Do that for the whole row for the case times the whole column in U (row of original data times column in the U matrix), add it up, and we get a ‘bundle’ score for the first case.  That will go in cell A1 in a new table. The cell is the score for case 1 on bundle 1.
• Then we can do the same for the second case, then the third.  And we will make a new column of bundled scores for the whole data set.
• Then we do the same for the second bundle (it’s OK – that’s what computers are for).
• Finally we have a matrix with as many rows as we have cases or training examples and as many columns as we have new bundles.  This can be useful when we need compressed data and we don’t mind a bit of blur in the data.

### #4 How many bundles do we need?

So now we come back to how many bundles do we need?  Well firstly, a lot fewer than the number of columns that we started with. That’s the whole idea of this exercise – to get that original spreadsheet a lot, lot smaller.

I mentioned before that we use the data for the second matrix, S, that is churned out by the SVD function in Octave, to work out how many bundles to keep.

This S matrix is also exactly the same size as the covariance matrix which was square with the same number of rows and columns as we had columns in the first, first, first data table.

This time, though, we only have data in the diagonal from top left to bottom right.  Every other cell is zero.  So that means there is a number for original row 1 and column A; row 2 and column B; etc.  Gee, couldn’t we have a column?  Yes we could. It’s laid out this way to do with the way machines do arithmetic. It is easier for the machine to pull out the matching diagonal from the U matrix for example.  But that’s not our problem right now.  We just want to know how to use these numbers to work out how many bundles to keep.

Well, these numbers represent how much variance is explained by the bundle.  The very first number (top left) tells us how much variance in the whole original data set is explained by the new bundle.  To work out what % of variance is accounted for by this bundle, we take all the numbers on the diagonal and add them up to give us a number representing all the variance in the whole data set.  Then we take the number for the first bundle (top left) and work it out as a percentage of the whole lot. If the percentage is less than 99% (.99), then we add another bundle (well we add the percentage of another bundle or we add the numbers of the two bundles and divide by sum for all the numbers).  We just keep going until we have enough bundles to account for 99% of the original variance.  (So in plain terms, we have allowed for 1% of blurring).

Oddly, only 1% of blurring might allow us to lose a lot of columns.  Or more precisely, when we compute new scores, one for each bundle in the final solution, we might have far fewer bundles than original number of columns, but still account for 99% of the original amount of detail.

### That’s it…that’s PCA.

#1 Get your data in a spread sheet (cases on rows, features in columns)

#2 Normalize the data so each column has a mean of 0 and an SD of 1

#3 Use a built-in function to return a matrix of eigenvectors (U) and variance (S)

#4 Decide how many ‘bundles’ of features to keep in (account for 99% of variance)

#5 Compute new scores – one score for each case for each bundle (now the columns)

And what do we do with this?

#1 We understand how columns hang together and what we can drop and only lose 1% of detail (or add 1% of blur)

#2 We can use the new scores to do other things like feed them into a prediction program or supervised learning program. The advantage is not to improve on prediction, btw,  but to cut down on computing costs.

That’s all!

### CHECK OUT SIMILAR POSTS

Here are the ten broad steps to build a spam catcher

1. Get a sample of emails that are known to be spam or not spam. Split the sample 60:20:20 to provide a “training” set, a “cross-validation” set and a “test” set.
2. Turn each email into a list of words by
• Stripping out headers (if not part of the spam test) and other redundancies
• Running NLP software to record the stem of a word only (for example, record city and cities as cit)
3. Count the number of times each unique word appears in the sample and order the list so that we can use the top 100 or 10 000 or 50 000 (whatever) to check for spam.  Remember to use stemmed words!
4. Convert the list of words in each email into a list of look-up numbers by substituting the row number of the word from the dictionary we made in Step 3.
5. For each email, make another list where row 1 is 1 if the first word in the dictionary is present in the email, where row 2 is the 1 if the second word in the dictionary is present in the email. If the word is not present, leave the value for that row as zero. You should now have as many lists are you have emails each with as many rows as you have words in your spam dictionary.
6. Run a SVM algorithm to predict whether each email is spam (1) or not spam (0).  The input is the list of 1s and 0s indicating which words are present in the email.
7. Compare the predictions with the know values and compute the percentage correct.
8. Compute the predictions on the cross-validation set and tweak the algorithm depending on whether the cross-validation accuracy is too similar to the training accuracy (suggesting the model could be stronger) or too dissimilar (suggesting the model is too strong).
9. Find the words most associated with spam.
10. Repeat as required.

### CHECK OUT SIMILAR POSTS

In this post, I am going to describe what I have just learned from Andrew Ng at Stanford about “learning curves”.  To computer scientist, a learning curve is what you might expect but describes how well data has been modeled.

I write this as a classically trained psychologist and it is clear that if we are to understand machine learning, we have to watch out for where the thinking of computer scientists differs radically from our own.  This is my commonsensical comparison of the two approaches.  I am writing it down to make sure I have followed what I heard.  It is rough and ready but may help you understand the differences between the two disciplines.

### A learning curve in CS

Simply, the CStists take random samples of data where the first sample is very small, let’s say 1 because that is helpful to understanding the logic, and the last sample will be large, let’s say a few thousand.  This is random samples from the same large data set.

Generally, with a sample of 1 up to 3, we can model perfectly.  However, when we try the same model with another sample of the same size, the model will not predict well at all. The amounts of error for the experimental sample and the comparison sample will be hugely different.  So far so good. That’s what we all learned at uni.  Modelling on a small sample is the equivalent of an ‘anecodote’.  Whatever we observed may or may not transfer to other situations.

As we increase our sample size, paradoxically the amount of error in our model increases but the amount of error in our comparison situation decreases.  And ultimately, the error we are making in the two situations converges.  We also know this from uni.

Much of our training goes into getting us to do this and to increasing the sample size so that the error in the hypothetical model goes up, and the error in the comparison model goes down.  Plot this on a piece of paper with error on the y axis and sample size on the x axis.

When the two error rates converge, that is we can explain the future as well as we can explain the present, then we stop and say, “Hey, I have found a scientific law!”

I would say that our willingness to tolerate a more general description of a particular situation so that we can generalize at the same level of accuracy (and inaccuracy) to another situation is one of the hallmarks of uni training. This is so counter-intuitive that many people resist so it takes uni training to get us to do it.

What the computer scientists implicitly point out is that the converse is also true. We are now able to explain the future as badly as we explain the present!  They call this underfitting and suggest that we try another model to see if we can do a better job of explaining the present.  So we will stop increasing the sample size and start playing with the model. We can vary the form of the model, typically moving from a linear to a non-linear model (that is adding more features) and increasing the weights of the parameters (go from a loose floppy kind of model to a stiffer model, if you like).

They do this until the model overfits. That is, until our explanation of the present is very good but the same explanation produces errors in comparison situations.  When they reach this point, they backtrack to a less complicated model (fewer non-linear terms) and decrease the weights of the parameters (take note of a feature but not put too much emphasis on it.)

Once they have found this happy middle ground with a more complicated model, but without the expense of collecting more data, they will try it out on a completely new set of data.

### Break with common practice in psychology

• This kind of thinking provides us with a possibility of getting away from models that have been stagnant for decades.  Many of these models predict the present so-so and the future so-so.  Here is the opportunity to break away.
• Note that machine learning specialists use procedures that look like statistics but abandon the central idea of statistics.  They aren’t promising that their original sample was randomly chosen and they aren’t directly interested in the assertion that “if and only if our original sample was random, then what we found in the sample generalizes to other samples that have also been chosen randomly”.  Though they do something similar (taking lots of randomly chosen slices of data from the data they have), they aren’t in the business of asserting the world will never change again.  They have high speed computers to crunch more data when it becomes clear that the world has changed (or that our model of the world is slightly off).
• Many of the rules-of-thumb that we were once taught fall away. Specifically, get a large sample, keep the number of features below the size of the sample, keep the model simple – these prescriptions are not relevant once we change our starting point.  All we want to find is the model that can generalize from one situation to another with the least error and high speed computers allow us both to use more complicated models and recomputed them when the world they described changes.

I am still to see good working examples outside marketing on the one hand and robotics on the other, but it seemed worth while trying to describe the mental shift that a classically trained psychologist will go through.  Hope this helps

### What is a neural network?

A neural network takes a set of binary (0/1) signals, groups them into a smaller set of hidden units, which are used to work out the probability of something happening.

### An example of a neural network?

The example used by Andrew Ng in the Stanford course

• Converts the image of a handwritten number into 20×20 = 400 pixels, i.e. a row of 400 1’s or 0’s
• The backward propagator works out how to group the 400 columns of pixels into 25 units (which are ‘hidden’ because the end user doesn’t need to know about them)
• And then the backward propagator does its magic again to work out weights to match combinations of 25 units onto the ten possibilities of a digit (0-9).

#### Forward propgation

The forward propagator takes a handwritten number, or rather the row of 400 1’s and 0’s representing the 20×20 pixels for a number, and runs the calculation forwards.  400 1’s and 0’s are multiplied by the weights matching a column to the 25 hidden weights. And 25 new columns are generated.  Each image represented by a row,now has 25 numbers.

The process is repeated with the weights from the 25 columns to the 10 digits and each image now has a probability for each of the 10 digits.   The biggest probability wins!  We have taken a list of pixels and stated what a human thought they were writing down!

#### Training accuracy

And of course, if we knew what the digit really was, as we do in a ‘training set’ of data, then we can compare the real number with the one that the machine worked out from the set of pixels.  The program run for Stanford students is 97.5% accurate.

#### Waiting for backward propagation

The real interest is in the backward propagator, of course.  Just how do they work out that there should be 25 units in the hidden layer and how do they work out the weights between the hidden layer and the output layer.

### Machine learning vs psychology

In psychology, we have traditionally found the hidden layer with factor analysis or principal components analysis.  We take your scores an intelligence test, for example.  That is simply a row of 1’s and 0’s!  We factor analyse the 1’s and 0’s (for you and hundreds of other people) and arrive at a hidden layer.  And from there we predict an outer layer.

We usually tighten up the inputs to reflect the hidden layer as closely as possible – that is we improve our tests so that 30/50 is meaningful.  And our outer layer is often continuous – that is we predict a range of outcomes which we later carve up into classes.  We might predict your A level exam results by % and then break them into A, B, C, etc.

So it is with great interest that I await the backward propagation.  I am also more interested in unsupervised machine learning which I suspect reflects real world conditions of shifting sands a lot more.

### CHECK OUT SIMILAR POSTS

This post provides a birds’ eye view of how to calculate linear regression using the numerical programming used by machine-learning people.  It is, of course, easier to do the linear regression in a statistics program but it is good to know and the overall structure probably provide a foundation for other machine-learning programs.

The algorithm works with Octave which is like a free version of MatLab.

I’ve also only given the birds’eye view because the code is part of the Machine Learning course at Stanford and writing it is part of the homework – if we release it, students’ won’t have to write it themselves and they won’t learn how anything.

## Example of linear regression

Imagine you want to buy a second hand car and you collect information on prices for a particularly model with the age and mileage of the vehicle.

The following can compute an equation to predict the price of the car from either the age of the vehicle or the mileage.  Using both age and mileage is a multivariate problem which has different algorithms.

## #1 Prepare data

~1 Normally, we would input the data into a table in Excel with the first column being age (or mileage) of the vehicle and the second column being price.

~2 Then we would save the file as a csv (comma separated values) text file which we will call for now mydata.txt

## #2 Load the data into Octave

~1 Start Octave from your list of Start/Programs

~2 Tell Octave where your data is stored.  For example, cd ‘c:usersmedata’

~3 Confirm you are in the correct directory by typing ‘pwd’

~4 Read in the data using these three commands

X= data(:,1); % reads all the rows of the first column of the data (age in our example) into matrix X

y = data(:, 2); % reads all the rows of the second column of the data (prices in our example) into vector y

m=length(y); % counts the number of training examples, or rows (age, price pairs in our example)

## #3 Visualize the data using plot

~1 Find out what the data looks like (inspect the data) by  using the plot function

~2 This was part of the homework so let’s say that the commands for a plot were put in separate file which is called with a simple command plot(X, y)

~3 Calls to a function can be made from the command line within Octave or another function.

## #4 Pick out some data to act as a test you have everything correct

~1 Look at the graph and pick two values of X (age of the vehicle in our example)

~2 Estimate by sight the predicted value of Y (the price of the vehicle in our example)

Hang on to these. You will need them at the end!

## #5 Set the ‘settings’ for the gradient descent

~1 Set the number of iterations and type these in at the command line or put them in another function.  For example, iterations=1500

~2 Set the learning rate.  For example, alpha = 0.01

~3 Set up the matrix of theta values (that is, the y intercept and the gradient of the graph. If price= a +b(age), the two theta values are a and b).  Type in theta= [0;0]. That sets the initial values of both parameters as 0.  A line like this predicts every price as 0 no matter the age of the vehicle.  But it is just our starting point!

## #6 Calculate the initial cost function

~1 Calculate the errors in prediction if we treat theta as [0;0], or that is if we treat price as a straight line and always 0.  The formula is in essence the sum of the square of the prediction errors divided by twice the number of cases.  I can’t derive this and I am not going to type it in because finding the right formula was probably part of the homework.

~2 Put the code into a function costCompute (X, y, theta) and save as a text file with extension .m (costCompute.m)

~3 This function is called repeatedly later because every time we improve our guess of the parameters (theta or a & b in the regression line), then our prediction errors will decrease.  Basically, we will stop trying to improve our parameters when we can’t reduce our prediction errors any further.

## #7 Repeatedly calculate new parameters

~1 The goal now is to iteratively improve our guesses of the parameters (theta – i.e., a & b in the regression line). The machine learning specialists call this ‘learning’ the parameters.  So, we are starting with [0,0] and we will slowly improve them.

~2  The formulas for changing the parametesr amount to calculating a minute change and taking it away from the last value.

~3 There is a different formula for the two parameters (a & b) largely because they start off as differently – as a  constant, a, and as bx. (It’s maths…)

~4 The constant, a, decreases by the alpha (set as low as 0.01 – see above) times the average error in prediction.  Again the formula is part of the homework so I won’t write it down here.

~5 The slope, b, decreases by the alpha times the average error in the prediction itself multiplied by the original x value.  I vaguely intuit this.  Again there is a formula.

~6  A new cost is calculated with the two new parameters and of course, the cost (think of it as the average prediction error) should have gone down.

## #8 Iterate lots!

~1 The iteration in this example was set at 1500!

~2 I don’t see how else the program ended. Presumably it could also be coded to end when the improvement in the cost (improvement in prediction errors) falls below an pre-stated, acceptable level.

## #9 Print out a linear fit

~1 Overlay a straight line graph on the original plot of the data

~2 Use Octave’s ‘hold on’ command to keep the old plot as the base

~3 To draw our prediction line, either calculate predicted values or simply calculate the predicted values within the plot command.  Plot (X, X*theta, ‘-‘)

## #10 Check the answer is reasonable

~1 Find the test data you set up in step 4.

~2 Calculate predicted values using the parameters we have calculated for the two test levels of X (i.e., what prices do we predict for our two ages of vehicle?).

~3 Do they make sense?  My primary school maths teacher told me ALWAYS to write the answer in plain English!

## #11 Visualize theta values in a 3D plot

~1 Use Octave’s surf command to visualize the cost (average prediction errors) of each combination of theta values (a & b).

~2 The 3d plot is bowl-shaped and the best combination of (a & b) is at the putative point where the bowl would balance (my rough understanding).

## #12 Visualize the contour plot

~1 Use Octave’s contour command to visualize the theta-theta graph (all those a’s plotted against all those b’s).

~2 Our final version of a and b should sit in the inner circle as if it was the highest point on the contour map of a mountain.

This is my record of the 12 steps in using gradient descent to do linear regression for a problem such as predicting of the price of a car from its age.  We need a large data set of recent data and we repeatedly put in values for the y intercept (value of a car when it is brand new) and the slope (the rate the value decreases). (Yeah, I know… a straight line is not necessarily the best model unless we start after the car is a year old).

The program nippily calculates how bad the model is and when it stops getting better, it stops and delivers the parameters for the line.  We can check the parameters graphically (because like all good data scientists we inspected our data at the beginning).

We can also use the contour and surf functions of Octave to plot the improvements in our estimations so that we can see what the program actually did.

I’ve written this to make sure I can do it again. I hope it is useful but the code itself is embargoed because otherwise future students of Stanford will not have to do any work and will not learn anything!

CHECK OUT SIMILAR POSTS