Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.


Should we webmasters just start blocking user agents wholesale?

I mean except known good actors.

I guess known actors would need a verifiable signature


Not viable. They are going to use user agents that look like those coming from completely normal human users.

"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.


Search engine crawlers do have verifiable signatures, if a client claims to be Googlebot or Bingbot you don't have to take their word for it.

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...


But the converse is not true? There is no guarantee the crawler is not amassing data for model training, or that a crawler (AI or otherwise) does not disguise itself as a normal user?


Yeah, but traffic appearing to come from normal users can be throttled and/or CAPTCHA'ed while still allowing Google and Bing to crawl to their hearts content so your SEO isn't affected.


I would think rate-limiting would be good. Crawlers are not patient enough to operate at the speed of a real human user.


Greedy crawlers will use fake user-agent strings.


It’s relatively simple to detect crawlers writing one from scratch could take a few weeks if the infrastructure was in place.

With salaries though finding an externally managed solution might be cheaper.


[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.

[1] https://crawlspace.dev




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: