I can't believe nobody has mentioned naive Bayesian text classification yet. It ...

kmavm · on May 17, 2010

I've looked into this a bit, albeit more in a spam-filtering context; tweets have very little text for naive Bayes to latch onto. 140 characters would be 20-30 words, tops. That is so few words that it is hard to move the prior very much, unless there are blockbuster words that almost always indicate a bad tweet; as the article suggested, "breakfast", "beer", etc.

fizx · on May 17, 2010

Here's a start. It's a simple ruby app that runs on heroku and proxies your twitter account, dropping uninteresting things. Extremely raw code.

http://github.com/fizx/sometweets

http://sometweets.heroku.com/

JCThoughtscream · on May 17, 2010

Filter by account, then, instead of by post? With a sample body of, say, the ten or twenty of an account's most recent posts.

kmavm · on May 17, 2010

The article's whole premise was that tweet quality does not correlate well within an account; e.g., some marvelous twitter streams include breakfast tweets.

sketerpot · on May 17, 2010

As I said, I'm fine with a filter that has poor accuracy, as long as it makes twitter marginally better.

jshen · on May 17, 2010

Dont many classifiers pick the most interesting n words and only use those to decide? Where n is something like 15.

sketerpot · on May 17, 2010

They usually have a lot more than 15 words to choose from, though.

jshen · on May 18, 2010

yeah, but I'm not sure it would make that much of a difference. I should throw together a classifier and see.

avk · on May 17, 2010

I'm currently solving this very problem with a much simpler algorithm and slowly letting folks into my beta over at http://slipstre.am/

retube · on May 17, 2010

I think stat/learning classifiers could work, even with the brevity of tweets. However, as the author says, one persons gold is anothers garbage. You'd need a platform/system that would allow for personal classifiers for each/every tweet consumer.