This post is a step-by-step, practical, guide to principal components analysis. It’s very hands-on and “common sensical”. If any experts out there spot an egregious error that would horribly mislead a beginner, please do let me know.
I’ll simply work through 4 steps and then sum up as 5 steps in a slightly different order.
I always like to start any data problem by thinking about my data rather concretely.
In a PCA problem, imagine a spreadsheet which has hundreds or even thousands of columns – far too many for comfort.
On each row of the spreadsheet is a case, or ‘training example’ in machine learning parlance.
What we want to do is to find the columns that matter. Alternatively, we ask “which columns could we bundle together into computed columns so that we have a more manageable number?”
In short, this procedure will tell us which columns we can leave out and which one’s we should bundle together. The bundles will be the principal components. And the final answer will tell us how much precision we have lost by bundling scores rather than using original raw data.
So, we begin with data which is laid out in matrix X with m rows and n columns (m x n).
#2 Normalize the data
The data comes in all sizes – little numbers and big numbers, very spread out and bunched together. First we smooth the data in the same way that tests are normed at college. Simply, we convert each column to a mean of zero and a standard deviation of one.
To be very clear how this works, we take each cell and adjust the number in the cell depending on the other numbers in the column. If we were working with spreadsheets, we would open another spreadsheet of exactly the same number of rows and columns and add this formula to each cell. So for cell A1 we would have:
= (Sheet1!A1 – mean(ColumnA))/StdDev(ColA)
When we calculate the mean and stddev of the columns in the new spreadsheet, they will all be 0 and 1 respectively.
#3 Principal Component Analysis
Now we find the ‘bundles’ of columns.
In my old days of statistics packages, the program would return a table which listed all the columns down the page and then produced factor loadings or weights for a whole heap of factors (or bundles) in more columns. The number and the sign would tell you the weight of the original data column in the new ‘bundle’. The weights varied from -1 through 0 to +1.
In Octave, the free version of Matlab, there is a facility to do PCA in two steps:
Step #3 Part One
- Compute what is called the covariance matrix. Simple imagine taking a copy of the spreadsheet (the second one), multiplying it cell to cell (A1 to A1) and taking the sum of those squares in the new column A, then A1 to B1 and taking the sum of the product in column B, then A1 to C1.. etc until we have new row with N columns each got by multiplying two columns and adding up the product. You’ll have to try it yourself. I’ll have to get out pen and paper when I read this a year from now.
- Then we do the same starting with Col B and Col A (that’s a repeat, I know.. stick it in), B to B, B to C etc.
- Until we have a new matrix with N columns and N rows. (Yes – this is what computers are for).
- And one more sub- step – divide every cell by the original number of cases or training examples (i.e., rows in the very first spreadsheet).
That’s the covariance matrix. In Octave, which uses linear algebra, it is much easier. You just tell the machine to multiply the transpose of the normalized data by the normalized data and divide by m – one line of code.
CovarianceMatrix = (X’ * X )/m
(That’s what computers are for!.. the explanation was just so you have a concrete idea of where the data came from and what happened to it).
Step #3 Part One
The second step in PCA is to find the bundles using a function that is built into Octave called the ‘singular value decomposition’ or SVD.
All you do is ask for it and it ‘returns’ three matrices, U, S and V and we are going to use U and S.
U gives us a matrix exactly the same size as the covariance matrix. Each column now refers to a ‘bundle’. The rows still refer as before to the features (that is the original columns in the data matrix and the normalized data matrix. Have a quick check. I’ll wait!).
Note we have as many bundles as we had columns right at the start but we don’t want to use all the unbundles (columns in the U matrix) otherwise we will have exactly the same number of columns as when we started – no point, hey?
So we will only use as many, starting from the left, as we need. We will decide how many to use on the basis of the S matrix.
Let’s do that later and consider further what U actually tells us. As we read down column one, cell A1 tells us the loading of original column A, or the 1st feature, on the new bundle 1. Cell A2 tells us the loading of original column B or the 2nd feature, on new bundle 1. And so on. Got it? Try it out.
So what can we do with this? Two things –
- We can see which of our original columns were the most important. They are the ones with the biggest numbers in column on the left and subsequent columns as you move right. A positive number means the higher the original number the higher would be the bundle score. A negative number in this new table means the higher the number in the original table, the lower would be the bundle score. Good to know if two of our original columns pull in the opposite directions. So that is the first use – to understand the original columns and how they hang together.
- The second use is to create a simplified data set. Ok, we hate it when bureaucrats create scores for us – like a credit rating. But imagine the rows are just pictures and the columns were the pixels or colors at 10 000 points on page – collapsing 10 000 columns into 1000 columns or 100 columns could be handy for data compression. (We’ll work out later how much data is lost – or how much blur is added!) So how do we recreate the scores? We will come back to this – lets stick with understanding what those numbers in the U matrix mean. All we have to do to get a score for the bundle is take the number in the U matrix for original column A (now in row 1) and multiple it by the score for the first case in column A (do it on a bit of paper). Do that for the whole row for the case times the whole column in U (row of original data times column in the U matrix), add it up, and we get a ‘bundle’ score for the first case. That will go in cell A1 in a new table. The cell is the score for case 1 on bundle 1.
- Then we can do the same for the second case, then the third. And we will make a new column of bundled scores for the whole data set.
- Then we do the same for the second bundle (it’s OK – that’s what computers are for).
- Finally we have a matrix with as many rows as we have cases or training examples and as many columns as we have new bundles. This can be useful when we need compressed data and we don’t mind a bit of blur in the data.
#4 How many bundles do we need?
So now we come back to how many bundles do we need? Well firstly, a lot fewer than the number of columns that we started with. That’s the whole idea of this exercise – to get that original spreadsheet a lot, lot smaller.
I mentioned before that we use the data for the second matrix, S, that is churned out by the SVD function in Octave, to work out how many bundles to keep.
This S matrix is also exactly the same size as the covariance matrix which was square with the same number of rows and columns as we had columns in the first, first, first data table.
This time, though, we only have data in the diagonal from top left to bottom right. Every other cell is zero. So that means there is a number for original row 1 and column A; row 2 and column B; etc. Gee, couldn’t we have a column? Yes we could. It’s laid out this way to do with the way machines do arithmetic. It is easier for the machine to pull out the matching diagonal from the U matrix for example. But that’s not our problem right now. We just want to know how to use these numbers to work out how many bundles to keep.
Well, these numbers represent how much variance is explained by the bundle. The very first number (top left) tells us how much variance in the whole original data set is explained by the new bundle. To work out what % of variance is accounted for by this bundle, we take all the numbers on the diagonal and add them up to give us a number representing all the variance in the whole data set. Then we take the number for the first bundle (top left) and work it out as a percentage of the whole lot. If the percentage is less than 99% (.99), then we add another bundle (well we add the percentage of another bundle or we add the numbers of the two bundles and divide by sum for all the numbers). We just keep going until we have enough bundles to account for 99% of the original variance. (So in plain terms, we have allowed for 1% of blurring).
Oddly, only 1% of blurring might allow us to lose a lot of columns. Or more precisely, when we compute new scores, one for each bundle in the final solution, we might have far fewer bundles than original number of columns, but still account for 99% of the original amount of detail.
That’s it…that’s PCA.
#1 Get your data in a spread sheet (cases on rows, features in columns)
#2 Normalize the data so each column has a mean of 0 and an SD of 1
#3 Use a built-in function to return a matrix of eigenvectors (U) and variance (S)
#4 Decide how many ‘bundles’ of features to keep in (account for 99% of variance)
#5 Compute new scores – one score for each case for each bundle (now the columns)
And what do we do with this?
#1 We understand how columns hang together and what we can drop and only lose 1% of detail (or add 1% of blur)
#2 We can use the new scores to do other things like feed them into a prediction program or supervised learning program. The advantage is not to improve on prediction, btw, but to cut down on computing costs.
CHECK OUT SIMILAR POSTS