A tutorial on Principal Components Analysis
Lindsay I Smith
February26,2002
Chapter1
Introduction
This tutorial is designed to give the reader an understanding of Principal Components Analysis(PCA).PCA is a uful statistical technique that has found application in fields such as face recognition and image compression,and is a common technique for finding patterns in data of high dimension.
Before getting to a description of PCA,this tutorialfirst introduces mathematical concepts that will be ud in PCA.It covers standard deviation,covariance,eigenvec-tors and eigenvalues.This background knowledge is meant to make the PCA ction very straightforward,but can be skipped if the concepts are already familiar.
There are examples all the way through this tutorial that are meant to illustrate the concepts being discusd.If further information is required,the mathematics textbook “Elementary Linear Algebra5e”by Howard Anton,Publisher John Wiley&Sons Inc, ISBN0-471-85223-6is a good source of information regarding the mathematical back-ground.
1
Chapter2
Background Mathematics
This ction will attempt to give some elementary background mathematical skills that will be required to understand the process of Principal Components Analysis.The topics are covered independently of each other,and examples given.It is less important to remember the exact mechanics of a mathematical technique than it is to understand the reason why such a technique may be ud,and what the result of the operation tells us about our data.Not all of the techniques are ud in PCA,but the ones that are not explicitly required do provide the grounding on which the most important techniques are bad.
I have included a ction on Statistics which looks at distribution measurements, or,how the data is spread out.The other ction is on Matrix Algebra and looks at eigenvectors and eigenvalues,important properties of matrices that are fundamental to PCA.
2.1Statistics
The entire subject of statistics is bad around the idea that you have this big t of data, and you want to analy that t in terms of the relationships between the individual points in that data t.I am going to look at a few of the measures you can do on a t of data,and what they tell you about the data itlf.
2.1.1Standard Deviation
To understand standard deviation,we need a data t.Statisticians are usually con-cerned with taking a sample of a population.To u election polls as an example,the population is all the people in the country,whereas a sample is a subt of the pop-ulation that the statisticians measure.The great thing about statistics is that by only measuring(in this ca by doing a phone survey or similar)a sample of the population, you can work out what is most likely to be the measurement if you ud the entire pop-ulation.In this statistics ction,I am going to assume that our data ts are samples
2
of some bigger population.There is a reference later in this ction pointing to more information about samples and populations.
Here’s an example t:
I could simply u the symbol to refer to this entire t of numbers.If I want to refer to an individual number in this data t,I will u subscripts on the symbol to indicate a specific fers to the3rd number in,namely the number
4.Note that is thefirst number in the quence,not like you may e in some textbooks.Also,the symbol will be ud to refer to the number of elements in the t
There are a number of things that we can calculate about a data t.For example, we can calculate the mean of the sample.I assume that the reader understands what the mean of a sample is,and will only give the formula:
Set1:
Total208
Square Root8.3266
8-24
9-11
1111
1224
Divided by(n-1) 3.333
Table2.1:Calculation of standard deviation
difference between each of the denominators.It also discuss the difference between samples and populations.
So,for our two data ts above,the calculations of standard deviation are in Ta-ble2.1.
And so,as expected,thefirst t has a much larger standard deviation due to the fact that the data is much more spread out from the mean.Just as another example,the data t:
also has a mean of10,but its standard deviation is0,becau all the numbers are the same.None of them deviate from the mean.
2.1.2Variance
Variance is another measure of the spread of data in a data t.In fact it is almost identical to the standard deviation.The formula is this: