I deployed this, instead of my usual honeypot script. It's not working very well...

MoonGhost · 2025-04-29T21:19:25 1745961565

Try content labyrinth. I.e. infinitely generated content with a bunch of references to other generated pages. It may help against simple wget and till bots adapt.

PS: I'm on the bots side, but don't mind helping.

palijer · 2025-04-30T00:27:27 1745972847

This doesn't work if you pay bandwidth and CPU usage for your servers though.

Twirrim · 2025-04-30T05:10:19 1745989819

The labyrinth doesn't have to be fast, and things like iocaine (https://iocaine.madhouse-project.org/) don't use much CPU if you don't go and give them something like the Complete Works of Ahakespeare as input (Mine is using Moby Dick), and can easily be constrained with cgroups if you're concerned about resource usage.

I've noticed that LLM scrapers tend to be incredibly patient. They'll wait for minutes for even small amounts of text.

MoonGhost · 2025-04-30T01:57:34 1745978254

That will be your contribution. If others join scrapping will become very pricey. Till bots become smarter. But then they will not download much of generated crap. Which makes it cheaper for you.

Anyway, from bots perspective labyrinths aren't the main problem. Internet is being flooded with quality LLM-generated content.

bugfix · 2025-04-30T01:00:21 1745974821

Wouldn't this just waste your own bandwidth/resources?

gwd · 2025-04-30T12:06:20 1746014780

Kinda wonder if a "content labyrinth" could be used to influence the ideas / attitudes of bots -- fill it with content pro/anti Communism, or Capitalism, or whatever your thing is, hope it tips the resulting LLM towards your ideas.

arctek · 2025-04-30T00:57:33 1745974653

Perhaps need to semi-randomize the file size? I'm guessing some of the bots have a hard limit to the size of the resource they will download.

Many of these are annoying LLM training/scraping bots (in my case anyway). So while it might not crash them if you spit out a 800KB zipbomb, at least it will waste computing resources on their end.

unnouinceput · 2025-04-29T21:30:32 1745962232

Do they comeback? If so then they detect it and avoid it. If not then they crashed and mission accomplished.

kazinator · 2025-04-29T22:32:00 1745965920

I currently cannot tell without making a little configuration change, because as soon as an IP address is logged as having visited the trap URL (honeypot, or zipbomb or whatever), a log monitoring script bans that client.

Secondly, I know that most of these bots do not come back. The attacks do not reuse addresses against the same server in order to evade almost any conceivable filter rule that is predicated on a prior visit.

jpsouth · 2025-04-30T09:46:46 1746006406

I may be asking a really silly question here, but

> as soon as an IP address is logged as having visited the trap URL (honeypot, or zipbomb or whatever), a log monitoring script bans that client.

Is this not why they aren’t getting the full file?

kazinator · 2025-04-30T13:37:27 1746020247

I believe Apache is logging complete requests. For instance, in the case of clients sent to a honeypot, I see a log entry appear when I pick a honeypot script from the process listing and kill it. That could be hours after the client connected. The timestamps logged are connection time not completion time. E.g. here is a pair of consecutive logs:

  124.243.178.242 - - [29/Apr/2025:00:16:52 -0700] "GET /cgit/[...]
  94.74.94.113 - - [29/Apr/2025:00:07:01 -0700] "GET /honeypot/[...]

Notice the second timestamp is almost ten minutes earlier.