What I find interesting it that it's much better to leverage quality data than a larger quantity of crappy data. To me, it's the difference between running a well-designed study on a small group versus a large study that's poorly controlled. Computation gymnastics can only do so much to clean up a multivariate mess. To improve data quality, you need to understand user psychology (i.e., better design). But any engineer can build a massive database. Problem is, how do you decide what's most important for the problem at-hand? Collecting more data, to figure it out later, only introduces more noise into the analysis.