- "bait urls" that other crawlers won't touch
- trigger by request volume and filter out legit crawlers
- find something unique about the headers in their requests that you can identify them with
One additional suggestion is to not block them, but rather, serve up different content to them. Like have a big pool of fake pages and randomly return that content. If they get a 200/OK and some content they are less likely to check that anything is wrong.
Another idea is to serve them something that you can then report as some type of violation to Google, or something (think SafeSearch) that gets their site filtered.
Another idea is to serve them something that you can then report as some type of violation to Google, or something (think SafeSearch) that gets their site filtered.