Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Look at your traffic logs and see if you can't fingerprint the scraper. Should be relatively easy since they're mirroring your entire site.

Then instead of blocking the fingerprint, poison the data. Introduce errors that are hard to detect. Maybe corrupt the URLs, or use the incorrect description or category. Be creative, but make it kind of shit.

It's easy to work around blocks. Working around poisoned data is much harder.



This... there are definitely aspects of the proxy that they aren't configuring or are unaware of.

i.e. ssl_cipher, http_x_requested_with, http_accept... and the order of all headers supplied... the casing of all headers supplied... TLS client HELO.

It is relatively easy, if you have enough signals, to essentially create a fingerprint that they won't understand how it works. Yet it will be effective at blocking it regardless of the IP.

Once you add enough of these together it will be hard for them to get around it without being obvious as they do so.

Super aggressive... those same fingerprints will reveal legit browser traffic and the fingerprints for things like Google-bot... so you could go towards a whitelist rather than blocklist. But this is a place you'd have to actively manage as new variations arise constantly.


This is some really cool anti-scraping inside baseball. Is it safe to say that Cloudflare uses these techniques for weeding out bots?


It's safe to say that if you have enough signals from every possible layer (of which the above a barely a few) that it becomes trivial to build a model that can identify the majority of bots.

However, then you're left with the really hard problem of when real browsers are used. But hey, you went a long way before you had to actually look at traffic patterns and in the meantime you've significantly raised the costs for those operating the bots.

It's also worth noting that if you really get enough signals, that bot writers cannot control them all. Everyone can rewrite a HTTP header, but can you pick the right HTTP headers in the right order with the right TLS cipher and TLS HELO to appear to be the same as Chrome on Windows? Good luck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: