Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Information Retrieval papers you need to read (scienceforseo.com)
63 points by Anon84 on April 8, 2009 | hide | past | favorite | 13 comments


Anon84, you have been churning out some good links lately. What is it? hack mode? beautiful isn't it? :-)

[sorry, had to make a private remark publicly, to someone I don't know because HN doesn't have messaging. just had to acknowledge that Anon84, and others like him, has been keeping the place more "hackish" and less TechCrunch-fluff.]


Thanks for the support.

I've been spending a lot more time lately reading up on this type of stuff for a huge scraping/data mining project. My background is mostly in Physics so depending on what I happen to be working on at any given moment there's always some catching up to do.

Anything particularly interesting gets posted here for others to enjoy and comment. :)


I'm doing the same right now. Massive scraping plus "gisting" or document summarization. You're pretty much on the right track; half of those papers are industry standards (my browser marked them as "visited" automatically :-)


So which of those would be a good starting point for someone who has no idea about IR?


Honestly, I don't think LSI is a good place to start. It'd be better to start a few steps back with basic IR algorithms and data structures like tries, inverted indexes, stemming, string search algorithms like KMP and Boyer-Moore.

When I got started with IR I used Baeza-Yate's "Modern Information Retrieval"; Manning's new "Introduction to Information Retrieval" looks great, though I haven't gotten my hands on a copy yet. Here are the Wikipedia pages on the concepts above:

http://en.wikipedia.org/wiki/Trie

http://en.wikipedia.org/wiki/Inverted_index

http://en.wikipedia.org/wiki/Stemming

http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algor...


start with LSA: laten semantic indexing, if you want information gisting/summarization.

i'm not worried about IR because I use a good search-engine library: Montezuma, and it's in Common Lisp. It's Java clone is called Lucene :-)


A Google search gave me this: http://www.cs.utk.edu/~lsi/

Is that what you're talking about?


Bruno, we've got to talk. I go to IU and you seem like a particularly interesting individual. I'll shoot you an email soon.


Good choice. It is most definitely an underrated school. I miss the burrow.


A couple other seminal favorites:

Authoritative Sources in a Hyperlinked Environment

http://www.cs.cornell.edu/home/kleinber/auth.pdf

Probabilistic Latent Semantic Analysis

http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

Kleinberg, author of the first paper, is quite possibly my favorite computer scientist. He's been on the leading edge of a lot of largely disjoint areas of research on the information structure of the web and almost always drops in some really smart stuff. He seems to hit new fields and drop in some big ideas and move on, so a lot of the more concretely applicable stuff comes in refinements of his work, but sometimes I feel like every time I cross into a new problem domain he's already written three papers on it. Beyond that, he writes in a very readable style, which is frustratingly rare for top-notch computer scientists.


The "rebel king" is also a phenominal lecturer and a favorite among all cornell CS students.


All good links. Thanks for sharing.

Another list of books for IR http://researchonsearch.blogspot.com/2005/12/information-ret...


Also you guys surely you would like this book too: http://www-csli.stanford.edu/~hinrich/information-retrieval-... http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf

Is a very readable and comprehensive introduction of the field. Besides it is recent and covers in a depth several current hot topics.

here is a review the book http://glinden.blogspot.com/2009/02/book-review-introduction...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: