Principal Component Analysis for Dummies

nroman · on Oct 30, 2013

When I was studying this in college I always found the "Eigenfaces" example very enlightening (http://en.wikipedia.org/wiki/Eigenface).

In case you're not familiar with them, the basic idea is treating a image of a face as a very high dimensional vector, and then doing what amounts to PCA on a collection of them. I'm leaving off a few steps, but the resulting eigenvectors converted back into images helped me grasp what was going on in a much more intuitive fashion.

Homunculiheaded · on Oct 30, 2013

About a year ago I repeated this experiment (well a very simplified version) with MegaMan sprites, including a quick write up of the process for anyone interested: http://willkurt.github.io/EigenMan/

neumann · on Oct 31, 2013

This is how we introduce it to our undergrads. Using the faces is a great way to demonstrate model reduction and snapshot method as well as a glimpse into one of the algorithms used in face recognition software.

thearn4 · on Oct 30, 2013

Another angle: the PCA is given by computing the SVD (a more general analog of eigenvalue/eigenvector decomposition) of a whitened representation of the data. Some idiosyncrasies of PCA them become obvious: we can't determine if a computed result is actually a sought result or its reflection/negative, because the SVD is only unique up to sign variance.

This is also closer to it's actual implementation: while it's true that you do technically need the eigenbasis of the covariance matrix, you should not actually form the covariance matrix to get there...

twstws · on Oct 31, 2013

> we can't determine if a computed result is actually a sought result or its reflection/negative

That's not how I'd explain it. The 'sought result' and its reflection are not two different things. They are the same thing, differing only in a trivial detail of orientation. The 'negative' of the result conveys exactly the same information as the result.

Radim · on Oct 30, 2013

  you should not actually form the covariance matrix to get there...

In case anyone's wondering why: it's not only because it takes extra time. The main reason is that computing X*X^T can bring numerical instability, where a direct SVD(X) would work just fine.

neumann · on Oct 31, 2013

I remember this being elegantly demonstrated here: http://math.stackexchange.com/questions/3869/what-is-the-int...

conjectures · on Oct 31, 2013

With mean corrected observations as rows of a data matrix X, the sample covariance matrix is given by X'X not XX'.

Also

(X'X)g = lg

(XX')Xg = lXg

So with n>>p and n large, you may be able to fit X'X (p^2 entries) but not XX' (n^2 entries) in memory (and get the same eigenvalues and related eigenvectors).

avn2109 · on Oct 30, 2013

It seems to me that explanations of technical topics in natural, everyday language are very valuable. People who would be turned off by a formalized explanation and dense symbolic manipulation can still get a lot out of this. Bravo.

adamnemecek · on Oct 30, 2013

Does anyone else have an extremely hard time trying to read it due to being blinded by the bright yellow?

blueblob · on Oct 31, 2013

Definitely. I don't think it would be that bad on a 1024x768, but widescreen it's really bright.

EDIT: By the way, you can make it readable with:

document.getElementById("wrapper").style.backgroundColor='white'

bzalasky · on Oct 30, 2013

I couldn't read it at all. Need to modify the stylesheet.

neltnerb · on Oct 31, 2013

I like the goal, thanks for the presentation. Personally, I'd have preferred an example on high-dimensionality data like curve fitting where this is actually most important, but that's because I'm a nerd and I like graphs perhaps.

FTIR data analysis is a fantastic example for PCA analysis -- each principle factor ends up (probably) being the spectrum of one of the major real physical components. But this is maybe too abstract?

A less abstract one might be a distribution of test scores. Your actual dataset is "number" versus "score", and you could show two gaussians, one at a low number and one at a high number. Then you could show that across three exams, you always see the same scores, but with different intensities. That would let you compute that the principle components are those two gaussians. Then you can hypothesize that each group is a collection of students that study together, and so they get similar scores. Or something like that.

Anyway, no intent to be a wet blanket. It's a nice writeup, and it is nice of you to share.

jmdeldin · on Oct 31, 2013

I found Lindsay I. Smith's "A tutorial on Principal Components Analysis" [1] really useful because it covers the mathematics behind PCA but gives enough linear algebra background for it to be understandable by those with distant or weak math backgrounds (e.g., me).

[1] http://www.cs.otago.ac.nz/cosc453/student_tutorials/principa...

Nick_C · on Nov 1, 2013

Thanks, that is incredibly interesting, especially the example at the end relating to facial recognition.

SanderMak · on Oct 31, 2013

Unfortunately returns a 403...

micro_cam · on Oct 31, 2013

PCA's are cool but I find it maddening when people convert sparse data (like counts of how many words are shared between documents) into dense distance data to use it.

You can shortcut the whole process by finding the smallest non zero eigenvalue/eigenvector pairs of the graph laplacian (Fiedler vectors). You need to use a sparse solver that can find the smallest values/vectors instead of the larges (like LOBCPG) but that is faster anyways.

ksk · on Oct 31, 2013

Wow, this brings back memories. I remember using PCA for feature extraction from image data to be used in SVM based image classification. Though as I recall, PCA added a huge tax on the processing time and provided, in comparison, a small boost in accuracy. (IIRC We split the data 4:1 into training & classification)

manoDev · on Oct 30, 2013

Thank you for this primer, this is related to something I'm studying right now and is much easier to understand.

dnautics · on Oct 31, 2013

doesn't the spread-out-ness then depend on the units? If your data have unit X on one axis, and unit Y on another, then how can you say that the "maximal-spread-outness" is in any given direction, when you can merely adjust the scale on one axis and alter how numerically spread out it looks?

Bill_Dimm · on Oct 31, 2013

Yes. The author of "Information Theory, Inference and Learning Algorithms" had this to say about it:

http://www.amazon.com/review/R16RJ2PT63DZ3Q/ref=cm_cr_rev_de...

joe_the_user · on Oct 31, 2013

An interesting argument.

It seems like PCA would already be a method that would only mean something if it was applied to comparable dimensions. What would transformed, dimensioned variables mean anyway? Chart A=mass - 3charge by B = mass + 2charge. What could a correlation mean.

twstws · on Oct 31, 2013

It's not dimensionally invalid if you remove the dimensions via normalization.

twstws · on Oct 31, 2013

In most cases each variable is standardized to put them all on a comparable scale.

dnautics · on Oct 31, 2013

normalizing to an arbitrary value still makes it arbitrary.

twstws · on Oct 31, 2013

Each variable is standardized to mean = 0, standard deviation =1. If you reject this as arbitrary, you are rejecting correlation analysis as a whole - this is exactly the same standardization done to two variables in bivariate correlation, extended to a multivariate data set.

PCA is a form of (or at least related to) correlation. With standardization the resulting transformation hihlights variables in the original data that are most highly correlated. Without standardization you're visualizing covariation. Unlike correlation, covariation is influenced by the magnitude of the variables.

By standardizing, you control for differences in the magnitude of the variables, and focus on their inherent variation instead.

dnautics · on Oct 31, 2013

1) this doesn't work for cases where your data are positive-definite.

However, let's set that aside. I apologize for being a bit obfuscatory. My point is: If this is the case, then the explanation in the OP is totally misleading, because your data shouldn't look like an ellipsoid, but rather a circle. PCA should only be used in situations where there is a reason to believe there is a mechanistically justifiable "hidden value" that underlies otherwise uncontrolled "independent variables", thus making a dimensional reduction reasonable.

This is not at all the situation that the OP goes over in the first part of the post.

twstws · on Nov 1, 2013

The example was a little clunky, but I don't find it misleading. A biplot of two normalized variables is elliptical, if the variables are correlated. This particular hand-drawn example does indeed look a bit weird, but that doesn't detract from the main point. It clearly shows the relationship between the original data and the ordination; it's a rigid rotation.

This is easily grasped with a 2d example, despite the fact that PCA makes no sense with only two variables.

therobot24 · on Oct 30, 2013

yes there's hundreds of these online - though i must admit that the images are well done to convey the main point of dimensionality reduction, those who don't necessarily understand eigenvectors or covariance matrices will be able to see what's happening between the lines

bsaul · on Oct 31, 2013

that's brilliant. i've studied the maths behind all this years ago, and only now do i find an intuitive explanation of all this. many thanks.

AsymetricCom · on Oct 30, 2013

These intuitive guides are great for those of us who've built an intuitive understanding of higher math through practical application on computers instead of formal, academic means.