Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can't believe nobody has mentioned naive Bayesian text classification yet. It sounds like it could work wonders for Twitter. I'm much more likely to be interested in tweets with words like "hylomorphism" than tweets with words like "omglol", and a text classification algorithm could learn that if you trained it up some. It doesn't have to be perfect; it just has to improve the signal-to-noise ratio significantly.


I've looked into this a bit, albeit more in a spam-filtering context; tweets have very little text for naive Bayes to latch onto. 140 characters would be 20-30 words, tops. That is so few words that it is hard to move the prior very much, unless there are blockbuster words that almost always indicate a bad tweet; as the article suggested, "breakfast", "beer", etc.


Here's a start. It's a simple ruby app that runs on heroku and proxies your twitter account, dropping uninteresting things. Extremely raw code.

http://github.com/fizx/sometweets

http://sometweets.heroku.com/


Filter by account, then, instead of by post? With a sample body of, say, the ten or twenty of an account's most recent posts.


The article's whole premise was that tweet quality does not correlate well within an account; e.g., some marvelous twitter streams include breakfast tweets.


As I said, I'm fine with a filter that has poor accuracy, as long as it makes twitter marginally better.


Dont many classifiers pick the most interesting n words and only use those to decide? Where n is something like 15.


They usually have a lot more than 15 words to choose from, though.


yeah, but I'm not sure it would make that much of a difference. I should throw together a classifier and see.


I'm currently solving this very problem with a much simpler algorithm and slowly letting folks into my beta over at http://slipstre.am/


I think stat/learning classifiers could work, even with the brevity of tweets. However, as the author says, one persons gold is anothers garbage. You'd need a platform/system that would allow for personal classifiers for each/every tweet consumer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: