Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If all you want to know is what fraction of all twitter accounts are spam accounts, it should be really easy:

1. Select 1000 accounts uniformly at random. Either from among all twitter accounts, or from active twitter accounts for whatever definition of "active".

2. Classify these 1000 by hand. Do as much investigation into them as you need to classify them accurately; no need to use heuristics here.

You will (with very high probability) get an estimate accurate to within a percent or so. If you do statistics you could find the actual bounds.



How do you get 1000 acounts at random? Does twitter have an API for it?


The stream API can sample, but then you see currently active accounts only.

Users are denoted by numerical ID, you can sample using this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: