Google Is All About Large Amounts of Data

bootload · on Dec 17, 2007

"I have always believed (well, at least for the past 15 years) that the way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons. The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective."

Thats how google is attacking the "parsing text problem" to find meaning. [0] Not with regular expressions, rules or clever AI hacks. Just plain old math.

[0] Attributed to Peter Norvig. Here's an example of what is suggested. A spell checker in about 25 lines python (old but good) ~ http://norvig.com/spell-correct.html

andreyf · on Dec 17, 2007

A spell checker in about 25 lines python (old but good)

21 lines in Python 2.5 code

robg · on Dec 18, 2007

What I find interesting it that it's much better to leverage quality data than a larger quantity of crappy data. To me, it's the difference between running a well-designed study on a small group versus a large study that's poorly controlled. Computation gymnastics can only do so much to clean up a multivariate mess. To improve data quality, you need to understand user psychology (i.e., better design). But any engineer can build a massive database. Problem is, how do you decide what's most important for the problem at-hand? Collecting more data, to figure it out later, only introduces more noise into the analysis.

henning · on Dec 17, 2007

For a more technical take on this idea, search YouTube for a Peter Norvig talk called "theorizing from data".

dood · on Dec 17, 2007

[http://www.youtube.com/watch?v=nU8DcBF-qo4]

Its a good talk, though maybe should be called "theorizing from massive amounts of data". Makes you wonder what Google are keeping under wraps for now, and how the powerset approach can compete, except maybe for domain-specific stuff.

henning · on Dec 17, 2007

well, they do statistical learning of inflections and closely related terms.

google "knows" (believes with high probability, I guess) that PWC is an abbreviation of PriceWaterhouseCoopers, for instance.

you can see this directly in how google highlights terms in search results. you can certainly find instances where they "should" have gotten something but didn't, or made a mistake, but it works well enough most of the time.

cellis · on Dec 17, 2007

Very interesting to me. But I know very little about Artificial Intelligence or ML.