Anon84, you have been churning out some good links lately. What is it? hack mode? beautiful isn't it? :-)
[sorry, had to make a private remark publicly, to someone I don't know because HN doesn't have messaging. just had to acknowledge that Anon84, and others like him, has been keeping the place more "hackish" and less TechCrunch-fluff.]
I've been spending a lot more time lately reading up on this type of stuff for a huge scraping/data mining project. My background is mostly in Physics so depending on what I happen to be working on at any given moment there's always some catching up to do.
Anything particularly interesting gets posted here for others to enjoy and comment. :)
I'm doing the same right now. Massive scraping plus "gisting" or document summarization. You're pretty much on the right track; half of those papers are industry standards (my browser marked them as "visited" automatically :-)
Honestly, I don't think LSI is a good place to start. It'd be better to start a few steps back with basic IR algorithms and data structures like tries, inverted indexes, stemming, string search algorithms like KMP and Boyer-Moore.
When I got started with IR I used Baeza-Yate's "Modern Information Retrieval"; Manning's new "Introduction to Information Retrieval" looks great, though I haven't gotten my hands on a copy yet. Here are the Wikipedia pages on the concepts above:
Kleinberg, author of the first paper, is quite possibly my favorite computer scientist. He's been on the leading edge of a lot of largely disjoint areas of research on the information structure of the web and almost always drops in some really smart stuff. He seems to hit new fields and drop in some big ideas and move on, so a lot of the more concretely applicable stuff comes in refinements of his work, but sometimes I feel like every time I cross into a new problem domain he's already written three papers on it. Beyond that, he writes in a very readable style, which is frustratingly rare for top-notch computer scientists.
[sorry, had to make a private remark publicly, to someone I don't know because HN doesn't have messaging. just had to acknowledge that Anon84, and others like him, has been keeping the place more "hackish" and less TechCrunch-fluff.]