tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you ...

dartos · on Oct 31, 2024

Should we webmasters just start blocking user agents wholesale?

I mean except known good actors.

I guess known actors would need a verifiable signature

rty32 · on Oct 31, 2024

Not viable. They are going to use user agents that look like those coming from completely normal human users.

"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.

jsheard · on Oct 31, 2024

Search engine crawlers do have verifiable signatures, if a client claims to be Googlebot or Bingbot you don't have to take their word for it.

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

yazzku · on Oct 31, 2024

But the converse is not true? There is no guarantee the crawler is not amassing data for model training, or that a crawler (AI or otherwise) does not disguise itself as a normal user?

jsheard · on Oct 31, 2024

Yeah, but traffic appearing to come from normal users can be throttled and/or CAPTCHA'ed while still allowing Google and Bing to crawl to their hearts content so your SEO isn't affected.

SoftTalker · on Oct 31, 2024

I would think rate-limiting would be good. Crawlers are not patient enough to operate at the speed of a real human user.

readyplayernull · on Oct 31, 2024

Greedy crawlers will use fake user-agent strings.

Narhem · on Oct 31, 2024

It’s relatively simple to detect crawlers writing one from scratch could take a few weeks if the infrastructure was in place.

With salaries though finding an externally managed solution might be cheaper.

andrethegiant · on Oct 31, 2024

[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.

[1] https://crawlspace.dev