Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the support.

I've been spending a lot more time lately reading up on this type of stuff for a huge scraping/data mining project. My background is mostly in Physics so depending on what I happen to be working on at any given moment there's always some catching up to do.

Anything particularly interesting gets posted here for others to enjoy and comment. :)



I'm doing the same right now. Massive scraping plus "gisting" or document summarization. You're pretty much on the right track; half of those papers are industry standards (my browser marked them as "visited" automatically :-)


So which of those would be a good starting point for someone who has no idea about IR?


Honestly, I don't think LSI is a good place to start. It'd be better to start a few steps back with basic IR algorithms and data structures like tries, inverted indexes, stemming, string search algorithms like KMP and Boyer-Moore.

When I got started with IR I used Baeza-Yate's "Modern Information Retrieval"; Manning's new "Introduction to Information Retrieval" looks great, though I haven't gotten my hands on a copy yet. Here are the Wikipedia pages on the concepts above:

http://en.wikipedia.org/wiki/Trie

http://en.wikipedia.org/wiki/Inverted_index

http://en.wikipedia.org/wiki/Stemming

http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algor...


start with LSA: laten semantic indexing, if you want information gisting/summarization.

i'm not worried about IR because I use a good search-engine library: Montezuma, and it's in Common Lisp. It's Java clone is called Lucene :-)


A Google search gave me this: http://www.cs.utk.edu/~lsi/

Is that what you're talking about?


Bruno, we've got to talk. I go to IU and you seem like a particularly interesting individual. I'll shoot you an email soon.


Good choice. It is most definitely an underrated school. I miss the burrow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: