Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Principal Component Analysis for Dummies (georgemdallas.wordpress.com)
147 points by jackkinsella on Oct 30, 2013 | hide | past | favorite | 31 comments


When I was studying this in college I always found the "Eigenfaces" example very enlightening (http://en.wikipedia.org/wiki/Eigenface).

In case you're not familiar with them, the basic idea is treating a image of a face as a very high dimensional vector, and then doing what amounts to PCA on a collection of them. I'm leaving off a few steps, but the resulting eigenvectors converted back into images helped me grasp what was going on in a much more intuitive fashion.


About a year ago I repeated this experiment (well a very simplified version) with MegaMan sprites, including a quick write up of the process for anyone interested: http://willkurt.github.io/EigenMan/


This is how we introduce it to our undergrads. Using the faces is a great way to demonstrate model reduction and snapshot method as well as a glimpse into one of the algorithms used in face recognition software.


Another angle: the PCA is given by computing the SVD (a more general analog of eigenvalue/eigenvector decomposition) of a whitened representation of the data. Some idiosyncrasies of PCA them become obvious: we can't determine if a computed result is actually a sought result or its reflection/negative, because the SVD is only unique up to sign variance.

This is also closer to it's actual implementation: while it's true that you do technically need the eigenbasis of the covariance matrix, you should not actually form the covariance matrix to get there...


> we can't determine if a computed result is actually a sought result or its reflection/negative

That's not how I'd explain it. The 'sought result' and its reflection are not two different things. They are the same thing, differing only in a trivial detail of orientation. The 'negative' of the result conveys exactly the same information as the result.


  you should not actually form the covariance matrix to get there...
In case anyone's wondering why: it's not only because it takes extra time. The main reason is that computing X*X^T can bring numerical instability, where a direct SVD(X) would work just fine.


I remember this being elegantly demonstrated here: http://math.stackexchange.com/questions/3869/what-is-the-int...


With mean corrected observations as rows of a data matrix X, the sample covariance matrix is given by X'X not XX'.

Also

(X'X)g = lg

(XX')Xg = lXg

So with n>>p and n large, you may be able to fit X'X (p^2 entries) but not XX' (n^2 entries) in memory (and get the same eigenvalues and related eigenvectors).


It seems to me that explanations of technical topics in natural, everyday language are very valuable. People who would be turned off by a formalized explanation and dense symbolic manipulation can still get a lot out of this. Bravo.


Does anyone else have an extremely hard time trying to read it due to being blinded by the bright yellow?


Definitely. I don't think it would be that bad on a 1024x768, but widescreen it's really bright.

EDIT: By the way, you can make it readable with:

document.getElementById("wrapper").style.backgroundColor='white'


I couldn't read it at all. Need to modify the stylesheet.


I like the goal, thanks for the presentation. Personally, I'd have preferred an example on high-dimensionality data like curve fitting where this is actually most important, but that's because I'm a nerd and I like graphs perhaps.

FTIR data analysis is a fantastic example for PCA analysis -- each principle factor ends up (probably) being the spectrum of one of the major real physical components. But this is maybe too abstract?

A less abstract one might be a distribution of test scores. Your actual dataset is "number" versus "score", and you could show two gaussians, one at a low number and one at a high number. Then you could show that across three exams, you always see the same scores, but with different intensities. That would let you compute that the principle components are those two gaussians. Then you can hypothesize that each group is a collection of students that study together, and so they get similar scores. Or something like that.

Anyway, no intent to be a wet blanket. It's a nice writeup, and it is nice of you to share.


I found Lindsay I. Smith's "A tutorial on Principal Components Analysis" [1] really useful because it covers the mathematics behind PCA but gives enough linear algebra background for it to be understandable by those with distant or weak math backgrounds (e.g., me).

[1] http://www.cs.otago.ac.nz/cosc453/student_tutorials/principa...


Thanks, that is incredibly interesting, especially the example at the end relating to facial recognition.


Unfortunately returns a 403...


PCA's are cool but I find it maddening when people convert sparse data (like counts of how many words are shared between documents) into dense distance data to use it.

You can shortcut the whole process by finding the smallest non zero eigenvalue/eigenvector pairs of the graph laplacian (Fiedler vectors). You need to use a sparse solver that can find the smallest values/vectors instead of the larges (like LOBCPG) but that is faster anyways.


Wow, this brings back memories. I remember using PCA for feature extraction from image data to be used in SVM based image classification. Though as I recall, PCA added a huge tax on the processing time and provided, in comparison, a small boost in accuracy. (IIRC We split the data 4:1 into training & classification)


Thank you for this primer, this is related to something I'm studying right now and is much easier to understand.


doesn't the spread-out-ness then depend on the units? If your data have unit X on one axis, and unit Y on another, then how can you say that the "maximal-spread-outness" is in any given direction, when you can merely adjust the scale on one axis and alter how numerically spread out it looks?


Yes. The author of "Information Theory, Inference and Learning Algorithms" had this to say about it:

http://www.amazon.com/review/R16RJ2PT63DZ3Q/ref=cm_cr_rev_de...


An interesting argument.

It seems like PCA would already be a method that would only mean something if it was applied to comparable dimensions. What would transformed, dimensioned variables mean anyway? Chart A=mass - 3charge by B = mass + 2charge. What could a correlation mean.


It's not dimensionally invalid if you remove the dimensions via normalization.


In most cases each variable is standardized to put them all on a comparable scale.


normalizing to an arbitrary value still makes it arbitrary.


Each variable is standardized to mean = 0, standard deviation =1. If you reject this as arbitrary, you are rejecting correlation analysis as a whole - this is exactly the same standardization done to two variables in bivariate correlation, extended to a multivariate data set.

PCA is a form of (or at least related to) correlation. With standardization the resulting transformation hihlights variables in the original data that are most highly correlated. Without standardization you're visualizing covariation. Unlike correlation, covariation is influenced by the magnitude of the variables.

By standardizing, you control for differences in the magnitude of the variables, and focus on their inherent variation instead.


1) this doesn't work for cases where your data are positive-definite.

However, let's set that aside. I apologize for being a bit obfuscatory. My point is: If this is the case, then the explanation in the OP is totally misleading, because your data shouldn't look like an ellipsoid, but rather a circle. PCA should only be used in situations where there is a reason to believe there is a mechanistically justifiable "hidden value" that underlies otherwise uncontrolled "independent variables", thus making a dimensional reduction reasonable.

This is not at all the situation that the OP goes over in the first part of the post.


The example was a little clunky, but I don't find it misleading. A biplot of two normalized variables is elliptical, if the variables are correlated. This particular hand-drawn example does indeed look a bit weird, but that doesn't detract from the main point. It clearly shows the relationship between the original data and the ordination; it's a rigid rotation.

This is easily grasped with a 2d example, despite the fact that PCA makes no sense with only two variables.


yes there's hundreds of these online - though i must admit that the images are well done to convey the main point of dimensionality reduction, those who don't necessarily understand eigenvectors or covariance matrices will be able to see what's happening between the lines


that's brilliant. i've studied the maths behind all this years ago, and only now do i find an intuitive explanation of all this. many thanks.


These intuitive guides are great for those of us who've built an intuitive understanding of higher math through practical application on computers instead of formal, academic means.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: