"I have always believed (well, at least for the past 15 years) that the way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons. The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective."
Thats how google is attacking the "parsing text problem" to find meaning. [0] Not with regular expressions, rules or clever AI hacks. Just plain old math.
[0] Attributed to Peter Norvig. Here's an example of what is suggested. A spell checker in about 25 lines python (old but good) ~ http://norvig.com/spell-correct.html
What I find interesting it that it's much better to leverage quality data than a larger quantity of crappy data. To me, it's the difference between running a well-designed study on a small group versus a large study that's poorly controlled. Computation gymnastics can only do so much to clean up a multivariate mess. To improve data quality, you need to understand user psychology (i.e., better design). But any engineer can build a massive database. Problem is, how do you decide what's most important for the problem at-hand? Collecting more data, to figure it out later, only introduces more noise into the analysis.
Its a good talk, though maybe should be called "theorizing from massive amounts of data". Makes you wonder what Google are keeping under wraps for now, and how the powerset approach can compete, except maybe for domain-specific stuff.
well, they do statistical learning of inflections and closely related terms.
google "knows" (believes with high probability, I guess) that PWC is an abbreviation of PriceWaterhouseCoopers, for instance.
you can see this directly in how google highlights terms in search results. you can certainly find instances where they "should" have gotten something but didn't, or made a mistake, but it works well enough most of the time.
Thats how google is attacking the "parsing text problem" to find meaning. [0] Not with regular expressions, rules or clever AI hacks. Just plain old math.
[0] Attributed to Peter Norvig. Here's an example of what is suggested. A spell checker in about 25 lines python (old but good) ~ http://norvig.com/spell-correct.html