Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The follow on post explains:

> You don’t really need any bot detection: just linking to the garbage from your main website will do. Because each page links to five more garbage pages, the crawler’s queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.

From: https://maurycyz.com/projects/trap_bots/



Thanks, I thought that these are prioritized, so while the garbage links might fill up the queue, they'd do so only after all real links are visited, so the server load is the same. But of course, not all/most bots might be configured this way.

> If a link is posted somewhere, the bots will know it exists,


How would the links be prioritized? If the bots goal is to crawl all content would they have prioritization built-in?


How would they prioritize things they haven't crawled yet?


It's not clear that they are doing that. Web logs I've seen from other writing on this topic show them re-crawling the same pages at high rates, in addition to crawling new pages


Actually I've been informed otherwise, they crawl known links first according to this person:

> Unfortunately, based on what I'm seeing in my logs, I do need the bot detection. The crawlers that visit me, have a list of URLs to crawl, they do not immediately visit newly discovered URLs, so it would take a very, very long time to fill their queue. I don't want to give them that much time.

https://lobste.rs/c/1pwq2g




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: