It's a long .gif file which shows that cloudflare's website loads just fine, but HIBP is unusable. Thanks Troy and Cloudflare for making this (free) service unusable. It's free, so I shouldn't expect that it works, anyway. Chrome on Linux and no VPN, fwiw.
The "Checking if site connect is secure" message pisses me off so much. They are lying straight to my face. It is about checking if they want to serve me, nothing about the site or my connection.
I believe the implementation is buggy, because I almost never see a Cloudflare page and I can consistently see one here when double-clicking the button.
This looks like an obvious misconfiguration, after passing the first quick Cloudflare check you should've been forwarded to a page with the right results. Sounds like Troy needs to fix his API calls.
Just out of interest, have you checked if installing Cloudflare's Privacy Pass bypasses the issue?
I’ve encountered this kind of loop on a number of different sites lately, so it may be more than a single misconfiguration. (US residential IP, using Firefox on Linux / Safari on iOS with Private Access Tokens disabled on principle).
I’ve just figured it was a fully intentional tarpit, and that I scored too high on some bot heuristic.
It's possible that the loop is just a tarpit, but based on the contents of the blog post I don't think this is intentional on the website's end.
I can think of different ways how this could go wrong (i.e. Cloudflare tries to forward the POST request and CSRF/WAF firewalls block the unexpected Cloudflare origin).
This has nothing to do with "English and American bias", this is simply the result of most traffic coming from your country being bots because of the amount of compromised machines that there are there.
Cloudflare doesn’t like my VPN so I get a lot of their challenges. Now whenever I see one I just close the tab.
Let’s say you run a SaaS using Cloudflare. You may be extremely happy you block 10000 bot requests for every false positive that’s a real human. But let’s say that false positive was a potential customer that would only have paid if they weren’t blocked, and now you just saved less than a penny in server costs to lose hundreds of dollars of lifetime value from that customer.
Sure if you run a free service use Cloudflare. Give in to centralizing the web more, supporting more censorship, and annoying the hell out of a ton of people in the process. But if you’re making money, I don’t see why you wouldn’t have authentication tied to paying users for anything of value, or think of bot traffic as a cost of doing business.
Depends on the application. I worked on one where new customers cost upgrades of $8 related to cost of reporting requirements (similar to background checks). Letting 10k bot requests supply stolen identities was incredibly costly.
Similarly, say an e-commerce business releases a limited edition product. Many users won't end up getting it anyway so blocking a few users is usually a much better experience than letting bots buy the product for resale later.
On the other hand, it's absolutely infuriating when blogs/search engines come up with these.
> You may be extremely happy you block 10000 bot requests for every false positive that’s a real human. But let’s say that false positive was a potential customer that would only have paid if they weren’t blocked, and now you just saved less than a penny in server costs to lose hundreds of dollars of lifetime value from that customer.
The problem is that economics are working on the opposite way. We are not happy blocking 10000 bot requests. We are happy blocking 1 bot that would make millions of requests. This means that it sometimes ok to loose few customers who would pay hundreds of dollars per month each, if it allows us to block one bot who would cost thousands.
Bots have the bad habits of targeting the most expensive parts of the system, if there is one query that is hard/expensive/could be abused, that’s the one that is going to be targeted in priority.
Cloudflare isnthe solution to VPN companies allowing bot accounts to taint their IP address.
It sucks as a real person using a VPN, but having your website be overwhelmed by bots suck more than a few VPN users having trouble using your website.
If Cloudflare didn't exist, websites would probably just block VPN IPs like streaming services do.
Cost of business. Depended on your SaaS, you could be saving more money blocking the requests to those bots than acquiring one customer. It’s a trade off you have to deal with when you get to a certain scale.
> That's a 91% hit rate of solved challenges which is great. That remaining 9% is either humans with a false positive or... bots getting rejected
If I meet a "human check", I quickly decide whether it is worth me solving it, or just close the tab. I could imagine 9% of people just giving up. Some of these CAPTCHAs require you to find 20 fire hydrants on 3 different rounds of tiles, just to fail you anyway. We have loads of data on websites keeping user's attention [1], this also seems to apply to CAPTCHAs.
Besides, I think it is now well known that AI is fully capable of solving CAPTCHAs.
> Besides, I think it is now well known that AI is fully capable of solving CAPTCHAs.
That's the biggest downside of modern AI, and I fear the web will only get worse because of it. If we can't figure out how to patch CAPTCHAs against bots, remote attestation will become the norm.
I have also thought about this extensively, but haven't really come to any real useful insights. A few ideas I have considered:
1. Embrace the bots and get each request (with response) super lightweight. Anything you can pre-compute, pre-compress, pre-cache is great. I've used this successfully for a small service that can scale significantly.
2. Make the cost of interacting with your service computationally expensive. For example, you could send off a problem to be solved which becomes itself a token to make one interaction. There are several problems that are computationally expensive to compute, but easy to verify.
3. Make the cost of interacting with your service require sending a significant payload - the idea being that if they launch many requests from a single network, they saturate their network. If to watch a 100MB Youtube video you had to send 1MB of random data via UDP (used to fingerprint), I suspect people abusing your service would soon find they experience dropped packets. If they struggle to send 1MB of random data, there's a good chance they would have trouble downloading 100MB of data.
4. A lot of these AIs falsify information to appear plausible. You could abuse this to ask questions, some real some false, and brief the user to answer randomly on the nonsensical questions. For example, "How many connections are there in a tripoduplex?" Something like chatGPT may see tokens for "tri" and "du" and output 3 or 2. There would also be a way to do this with images, i.e. "Select all of the cats in the image and press done", where they are all some weird trip of images.
These are just some ideas and there are obvious flaws in some of them.
I'm trying to buy a car, which involves going to a bunch of dealer websites and seeing what they have in stock. Usually after checking 2-3 dealers I'll get a Cloudflare error and can't access any car dealer websites for a day or two (or I can keep going if I use a different browser, or just Incognito mode).
I guess "Turnstile" might be what I'm running into?
This is as practical solution to a very real problem. HIBP is so valuable both the no API case, and the case where bots scrape the API reduces people's online security.
However relying on the near universal behaviour tracking and fingerprinting of large corporations is extreemly worrying. The better Turnstile works the more like Google's proposed Web Environment Integrity it becomes.
"Invisible" assuming you have javascript enabled and use a mainstream browser. The failure mode on these is worse than regular captchas because cloudflare won't even give you a chance to prove that you're a human, you'll just be stuck in a refresh loop.
You don't even need to do anything unorthodox. I'm using Firefox with Javascript enabled, set to block fingerprinting and cross-site cookies, and Cloudflare's bot detector regularly puts me into infinite loops. It's especially frustrating when it's the login page for a paid service doing this to me. Why on earth would I abuse your site with a bot using credentials associated with my real name and payment info?
Surely people who disable javascript are used to the majority of the web being broken for them. JS is an integral part of the web, regardless of how people feel about it.
> Surely people who disable javascript are used to the majority of the web being broken for them.
Not really. I'd say >80% of sites I visit are accessible, in that I can read the content of the page I want to look at, with javascript disabled.
Of the remainder, I can temporarily switch javascript back on for a tab with two clicks, if I want to read the contents enough, and I think the odds of that site having especially malicious javascript on it is slim. (e.g. mastodon instances)
I can also create permanent exceptions for websites in a couple of clicks too, and there are some frequently-visited sites that I have done that for.
Don't think of those who disable javascript as browsing with no javascript. Think of it as browsing with javascript disabled by default.
it doesnt matter if you enabled it or not, you need to allow Clownflare's guerrilla fingerprinting to be allowed access. Which is a huge downside, unless you have access to a bunch of proxy servers and the knowledge to randomize and spoof your fingerprint accurately.
I did a lot of work on this many years ago at Google. As the article says, it can work well and be minimally invasive for users (they need to run JS but that's a much lower bar than solving complicated CAPTCHAs).
There are several services like Turnstile. I'm an advisor to Ocule [1] which is a similar thing, except it's a standalone service you can use regardless of your serving setup rather than being integrated into a monolithic CDN. It's a smaller company too so you can get the red carpet treatment from them, and they aren't so aggressive about blocking VPNs and privacy modes because their anti-bot JS challenges are strong enough to not need it. They're ex-reverse engineers so know a lot about what works and what doesn't. Their tech may be worth looking at if you're concerned about over-blocking.
The mention of Turnstile using proof of work/space is a bit puzzling/disappointing. That stuff doesn't work so well. There are much better ways to create invisible JS challenges. The core idea is to verify you're in the intended execution environment, and then obfuscate and randomize so effectively that the adversaries give up trying to emulate your code and just run it, which can (a) lead to detection and (b) is very slow and resource intensive for them even if they aren't detected. Proof of work/space doesn't prove much about the real execution environment.
BTW, the author asks what proof of space is. It's where you allocate a huge amount of RAM and then ensure the allocation was actually real by filling it with stuff and doing computations on it. The goal is to try and restrict per-machine parallelism by causing OOM if many threads are running in parallel, something end users won't do. Obviously it's also a brute force technique that can break cheaper devices.
I think the idea is they ramp up the difficulty of the proof of work when a user is suspect by their other tests, after ramping it up to bot levels it causes those requests to become more costly even if they're not blocked.
It hardly works. PoWs can be hyper-optimized in bot code compared to in browser JS, and your PoW has to be tolerable even for slow old devices so it can't be too demanding anyway.
It also assumes a very smooth gradient of suspiciousness. When using JS challenges though, you can be often working with binary signals (at least, that's what I was able to get). So then there's not much "maybe so/maybe no" about it. Either you detect a bot with 100% confidence, in which case you just drop the banhammer. Or you don't spot it and have to let it through.
Interesting that both the client side API and server side API for Cloudflare's turnstile seem to match Google's reCAPTCHA nearly exactly, which works in pretty much the same way with the exception that you can't configure it to _never_ show a visual captcha (in rare cases the v2 Invisible reCAPTCHA will still show the "select all the X from the images below" dialog)
Even down to the API endpoint and JS API names.
https://www.google.com/recaptcha/api/siteverify
https://challenges.cloudflare.com/turnstile/v0/siteverify
grecaptcha.render({ callback: function (token) { ... } });
turnstile.render({ callback: function (token) { ... } });
As soon as I saw the examples I recognised the names, I guess it's designed to be a drop in replacement?
i think it was intended as a replacement; made it easier for me to give my clients the possibility to choose between different captcha services while i only have to code one (with some minor quirks) implementation
That peak is around 400 req/s, right? I would expect the usual solution to be to just rate limit each user to some reasonable limit, and then 429 if theyre over.
I feel like 400 req/s should be absorbed, especially since my 5€/month VPS can handle about 2x that, sustained. I might be missing something here, but that just doesnt seem like a peak big enough to warrant degrading user experience.
Sadly I'm often categorized by websites as "probably a bot haha get fucked", and its lost sites hundreds or thousands of $$$ worth of revenue over the years, just from me alone.
The service is unauthenticated so there aren't any "users" to rate limit on. You could try to guess based on IP but it's trivial to get access to massive amounts of IPs (there are proxy/scraping services that do this for you)
That's 400 req/s on a distributed cloud environment. 400 requests per second on a $5 VPS is practically free, 400 requests per second on Azure Lambda (or whatever they call it) can bankrupt you.
That $5 VPS will also be null routed within days with the kinds of DDOS traffic sites like HIBP get.
Where is your 5 euro/month VPS hosted, and does it use HTTPS?
I did a non-scientific test last year, in which I ran a basic Apache instance that returned a 204 response (No Content) on $5 instance and HTTPS alone made it drop around 500 requests/second (once again, no other processing happening other than returning headers for a 204 response)[1]. My understanding at the time is that in general on lower priced virtualization options you don't get VMs which provide access to CPU instructions that speed up encryption/decryption.
any reasonably multithreaded http server with nginx infront should easily serve 500+ requests per second, on something like a 5€/month hetzner VPS (x86, you get 2x the cores if you choose arm but thats a new offer so ill pretend it doesnt exist).
Only then will it cause significant delay. If your service drops connections at 500+ requests per second, something is off - You can configure quite the large accept backlog, like a few thousand, and you should only get an additional few ms delay per additional request.
If you're bottlenecked anywhere, it shouldnt drop connections until its like very very bad.
Thats assuming you need to access a DB or do other checks -- static content servers should start struggling on a 2-4 core system at around 10k requests per second, but that doesn't really matter for this.
Can you share your Hetzner VPS address, do you do your own TLS termination and can I test your endpoint with Apache benchmark? I ask because theoretical numbers, in my opinion, are thrown around to often, and when I hear thousands of responses/second on budget VPS's I want to see real life applications that actually handle that load. In my experience the difference between Apache and nginx is negligible, and if your service that runs on a $5 (or euro) a month VPS handles thousands of requests/second I'm going to be more than impressed when I see those numbers myself through a test.
I was actually going to test that myself, for you, with a $5 Linode. However, I spent about an hour fighting with siege/drill/hey and finally gave up. siege crashes and locks up all the time, as well as sometimes getting stuck for a minute or two between requests. hey reported 2 req/second for https, which I don't find believable either...
Sure but there's still TBs of data to check? Ideally it uses a data structure like a tree or hash that's efficient for searching through that amount of data (which might just be an index in a relational DB).
That's like saying every S3 request searches though exabytes of data. No, each request only access the data it needs. Yes, the dataset is large but "each request" only reads a few megabytes at most.
"Which poses an interesting question: how do you create an API that should only be consumed asynchronously from a web page and never programmatically via a script?"
Web developers use JavaScripts to make HTTP requests to API endpoints. The data is being consumed by a script, programmatically. Unless Javascripts are neither scripts nor programs. Good luck with that argument.
There is a W3C TAG Ethical Web Principle that states web users can conusme and display data from the web any way they like. They do not need to use, for example, a particular software or a particular web page design.
2.12 People should be able to render web content as they want
People must be able to change web pages according to their needs. For example, people should be able to install style sheets, assistive browser extensions, and blockers of unwanted content or scripts or auto-played videos. We will build features and write specifications that respect peoples' agency, and will create user agents to represent those preferences on the web user's behalf.
With respect to the JavaScripts authored by web developers and inlined/sourced in web pages, web users have no control over them short of blocking them outright. As such, arguably they are not ideal for making HTTP requests to API endpoints. Unfortunately these scripts can be, and are, used to deny web users' agency.
He wants to provide a free service, which he is not obligated to do and costs money for him to do, for individuals to check their email. He never intended for bots to check a billion emails, and obviously doesn't want to pay for that.
That should be respected, and people failing to respect it is why we see the destruction of the open web with things like remote attestation as the only way forward.
Complaints like "but the open web" or "but my exotic browser" are honestly worth nothing against a potential solution for a real, pressing issue like spam requests and bots - if we want to keep the nice things we better start coming up with alternative solutions to the real problem, because corporations will decide the future for us if we don't. Ignoring it will not work out for us.
OK, but remote attestation isn't really a thing on the web (right now). And you can script a browser UI, anyway.
The point is to not get hung up on the client as a security boundary (it isn't, can't be, won't be), but to focus on the actual harm -- excessive use of the limited resources provided.
And you have to frame your security posture as rooted in the server side throttling, heuristics, et cetera. Flipping out because the client isn't what you expected isn't going to help (determined attackers can look like "vanilla" clients), and is just going to harm the long tail of actual users who are not bots.
Private Access Tokens are built into Safari. Apple gives their devices a certain amount of tokens, and Cloudflare validates them. If Apple doesn't like your device, you won't get any more tokens. Cloudflare also hands out tokens if you install their browser addon.
The two companies are working together to make this an official web standard, but I haven't heard about it for a while. Maybe they're just laying low after seeing the blowback on Google's (worse) attempts at attesting devices.
I agree. Our public APIs are also massively queried. The number of queries is out of proportion to the legitimate traffic. Rate-Limiting does not work, because of the volume of different ips they send against you in parallel. Our servers are not designed for such peaks. What other choice do we have but to block them.
As in "valid" but false data? Please don't. If you really don't want to indicate rate limiting explicitly, then perhaps return an invalid body, or reset the connection or similar. False positives detecting humans as bots are very common, and even rate limits are often set well within human interaction limits. E.g. more than once I've triggered 429s by opening several e-commerce product pages in new tabs for me to ctrl+tab through and filter down. I also tripped a LinkedIn anti-automation system since I was looking through quite a lot of profiles on my first day to add people - luckily they handled this well, with a clear message explaining what was going on and support reaching out to me proactively (and lifting the restriction after a few hours)
> if we want to keep the nice things we better start coming up with alternative solutions to the real problem
Indeed. To state the obvious: the real problem are bad actors - no matter if they are nation states, cybercriminals or people running compromised devices - and their accomplices such as ISPs not responding to abuse reports.
As long as we don't get that under control (say, by threatening to cut offender countries and ISPs from the Internet and SS7 phone networks) we'll have to continue whack-a-mole'ing.
> He wants to provide a free service, which he is not obligated to do and costs money for him to do, for individuals to check their email.
Not a good choice, why not make this a hashed db that could be distributed freely? Why are bots meant to be excluded? This is bad UX, I'd want to check my emails periodically in the background, but this "anti-bot" measure is meant to make me unable to do so and then demand payment. So this is openly a fight.
> are honestly worth nothing against a potential solution for a real, pressing issue like spam requests and bots
That is not an issue for me at all. Ok, let's face it - maybe I'm just on the other side. For me scrapers are very useful because they reduce the price needed to access the data as they introduce competition. For example price comparison sites are very useful.
Surely smart bots would call the API (https://haveibeenpwned.com/API/v2) rather than try to scrape the web form. HTTP calls to a web endpoint are much easier to scrape than whatever the website frontend decides to call. All you need to do is parse the Retry-After headed for the 429 error code and you can pretty much query away without worrying about CAPTCHAs.
If the service even provides such header or 429 status code at that. They could provide the 418 status code for fun. In this case – without checking the form implementation – I can assume that it provides a JSON response rather than a text response meant to be thrown into the DOM. Grabbing data from JSON is naturally easier but you could use a DOMParser for text content itself if it's sufficiently consistent.
The other thing about "waiting" is that bots may not want to do that, or maybe some sort of deadline is sooner than such waiting would allow.
To me, requiring a unique key to be input with the search, that is created after X amount of time (both provided by the initial response and increasing exponentially for subsequent requests) seems like it could be sufficient. If the next request is sooner than X then block them for Y amount of time for attempting to bypass allowable behavior. Allow like 5-10 emails then implement the wait-based functionality so that most non-bots would be fine. After all we're talking about blocking an endpoint only ever meant to be used via the website by actual users, typically they are not trying to check thousands of emails super fast.
It often costs far more then money. In the this case bots were gaining data to better target future attacks. In the common spam case it costs real user's attention.
There definitely is an alternative solution here that preserves people's freedom to use whatever browser they want, but I'm not sure if anyone would like it.
Web services are intended for humans to use them, and all abuse also ultimately comes from humans directing computers to be abusive. Thus, rather than attaching a computer's temporary identity (an IP address) to the request, we should be attaching a human identity to it. Note here, that I don't care whether the request comes from John Smith in New York. I care about being able to ban you from the service if you are abusive, no matter how many computers you have at your disposal now or in the future.
There's lots of downsides I haven't got solutions for, such as, how do we stop sites cross referencing to reliably discover all users real identities (Google analytics would love this!), and so much more. But we'll have to give up something - if we do nothing we give up everything, maybe giving up only absolute anonymity could be preferred.
IP reputation doesn't seem to be much of a thing right now, but it might be the only way forward for a spyware-free web.
You can have a group of people sharing a block and collectively responding to abuse reports to slightly improve privacy. That at least shields you somewhat from the big tech firms.
Essentially a large number of virtual private micro-ISPs.
> IP reputation doesn't seem to be much of a thing right now
disclaimer: acquaintance works in the spam business. someone i tried to steer clear of, but was fascinated of what they told me.
IP reputation is a huge thing in the spam world. they pay top dollar for residential US/UK/etc IPs which they can then use for spamming others. and we're not talking one or two IPs. we're talking 100s of thousands of IPs being transacted daily globally. all for spam.
for anyone interested in seeing how low we've come, they have a huge convention in Las Vegas. one should visit it to better understand a field that has been growing like crazy but so few know it. apparently everyone is preparing for a big boom next year, with 10s of $B ready to be deployed.
IP reputation is absolutely a thing, as I understand it it's a large factor in Cloudflare deciding to put you in an infinite loop (or trigger the protection at all).
I still don't get why not just implement some basic usage rate limiting. Go two fold limit the rate to 2x or 3x the fastest they can manually use the service and limit users to only being able to burst like that for a bit.
You're going to have zero horid scraping once that's in place.
Because it doesn't work. Malicious users can circumvent the rate limiting by using botnets. If requests were somehow tied to the identity of the human operator rather than the particular computer they used, then yes, rate limiting would be all we need.
Offer a "free tier" with a global limit of N requests / second, and allow people to get an API key through email (for free also) which bypasses this, then rate-limit each API key.
Dedicated scrapers have been using IP pools for more than a decade. Search for (residential) proxy pools and you’ll find a million vendors selling access. IP-based rate limiting stops cheap skids and screws over people behind CGNAT, that’s all.
That doesn't solve the problem with bots in games. They are not hammering the API, they are automating things that shouldn't be automated and drives away non-bot players.
I asked an online shop I frequent if they have an API, as they discount some inventory sometimes, to which they responded there is no public API. It's easy enough to find (using Inspect->Network->XHR) how they serve data, and with just five minutes of work, their raw JSON can be loaded in Python and polled for specific brands/categories/individual products, including anything flagged for a discount.
Programmatic? Yes. Against their wishes? Also, yes.
The guilt or bad feelings subsided after realizing I'll use more of their bandwidth how I was searching before via the browser. Using the API, I see when the last inventory update occurred, and only after this will I search their discount inventory.
(In any case, if they are on EC2, I cost them 3 to 8 cents per month, estimating high, which they have certainly recouped.)
They could (incorrectly) call you a "bot" but that's simply not true. You are human.
The amount of server resources you deplete is actually less than a modern graphical browser from an ad-sponsored team of software developers.
However perhaps you are not looking at some ads or submitting yourself to tracking or telemetry. Under the W3C web ethics principles you can consume the data from the web in any way you like.
It is your right to decide what HTTP requests you make. XHR/fetch in someone else's Javascript could be well-intentioned but too often it tries to take away some of that agency and transfer it to a web developer.
It's not so clear-cut. I have a Gitlab action that checks for the update every hour and then emails me when a match is found, so I technically have employed a bot. It saves me time and saves them bandwidth.
For tracking, they'll have to rely on old-fashioned "order history" and "inventory period". By these metrics, I would be baffled if they decided to block my access or close my account, even if I'm bypassing Google Tags/their AB test suite/social media analytics/ads for their other services.
Ignoring their nonconsent feels impolite, but they can dry their tears with my spent money.
What about botting in games? It's something that almost everyone is against so it's clear that the general opinion is not aligned with the W3C TAG Ethical Web Principle there.
I've seen many games die due to bots. If developers just straight-up allow them all the normal players quit.
Just put in a multi second delay to _all_ requests to that API, humans wait but this costs bots and slows the aggregate progress.
Add HTTP headers documenting the API you want them to use as a bot author will look and wonder why it's performing so badly.
The difference between human speed and computer speed is so noticeable that it can be leveraged... No high bar or complex adversarial solution needed, and the side benefit is that if a bot does persist it spreads the load.
Not sure why downvoted without much comments. I can see a possible problem in that the bots would just establish multiple parallel connections and wait.
This will harm the experience for the intended user and will barely affect the bots because they will just make more requests in parallel. The user is going to be much more speed sensitive than a scraper.
The bots send out half a million requests, who cares if they need to wait ten seconds per request. You're adding a total of ten seconds to hours of scraping.
I am most likely misunderstanding this, but why can't you just use browser automation to then generate the turnstile token? For example, just use Playwright/Selenium/Phantom to generate the token and then use that in an API call?
Cloudflare's code attempts to detect browser automation is happening.
For example, desktop computer but clicking things without moving your mouse? Suspicious. Say you're a phone, but have desktop computer fonts installed? Suspicious. And suchlike, the precise methods are the results of a cat-and-mouse game.
If these heuristics identify your browser as suspicious, they either show you an interactive captcha, or they just refuse your request.
Yup, some of these are pretty invasive like opening websocket connections to localhost to try to find daemons running on the clients machine (I think eBay and maybe others were doing this)
And in addition to the implicit proof-of-resources of forcing attackers to run a bunch of Chrome slaves, there' are also explicit POW/S challenges in the code, according to the article. It's quite an old idea [1], to add a cost which is trivial for users but a significant overhead for spammers
> the unsolved challenges are when the Turnstile widget is loaded but not solved (hopefully due to it being a bot rather than a false positive)
So people employ these measure and have no clue whom they filter. Reminds me of the online shops who block you because you click on the products too fast. Congratulations, you lost a customer to keep the CPU utilization at 20%.
HIBP receives a lot of commercial support from Microsoft and Cloudflare, and in turn provides their API to organizations like government agencies and Mozilla. Half of the articles about it are "this company gave me this free/cheap thing, here's how I implemented it" and I can't feel too bad about it.
What's weird is that the current implementation is broken. Once you do get a CAPTCHA to fill out, the redirect will fail and you end up starting over, only to get a new CAPTCHA page. That's rather unfortunate.
Really interesting article and details, we are in similar cases, I would definitely consider such implementation and we'll also look at alternatives to Cloudflare.
Thanks you Troy for writing and sharing your experience.
For this style or abuse mitigation I’m always surprised that HashCash [1] or similar simple, locally implemented proof of work mechanisms aren’t more common.
This can be implemented in a way that remains transparent (albeit via JS), poses little impact on ‘good’ users, but protects against a lot of traffic patterns that may be undesirable. The cost can be scaled to match infra capability and the challenge can be a combo of the request data and time. Valid windows for that time can then be synced with cache validity which removes the need to keep tabs on any state.
For those deeper in this space. What am I missing here that prevents this from being the norm?
It turns out some of the abusers are using 'botnets' of thousands of virus-infected home PCs. So they've got thousands of CPU cores available for proof-of-work challenges, legitimate residential IP addresses, and so on.
Meanwhile, plenty of the legitimate users are using 5 year old budget android devices, so you'd better not make that challenge too hard.
Yeah, there's lots of these floating around sometimes called "scraper service" or "residential proxy". Not sure if it's still around, but one of them enlisted machines by paying users to install a browser extension.
There was one famous free VPN service that worked like this. You install the addon, get a free VPN for a certain amount of traffic, and while your browser is open other people will be able to browse from your IP (and access your home network, of course!)
Making the browser deal with PoW challenges is only a small price to pay for what is practically a free VPN. It works great, until your entire home IP starts getting CAPTCHAs all the time, and because users don't know any better, they start blaming that darn Google/Cloudflare/Microsoft for claiming they're a bot.
I'd be a lot less concerned about more unnecessary captchas and more concerned about what kind of traffic VPN randos are piping in and out of my home ip address.
This could maybe work as a legit region blocking workaround service if the VPN only allowed connecting to popular streaming sites somehow. I don't think I'd trust it though.
That's the thing, most users didn't know they joined a botnet, they just thought they cheated the system by finding a free VPN.
I think there's some merit to the idea, especially for free VPNs, but you'd need a whitelist for things like streaming services and something to prevent abuse. Of course these companies were just interested in building out botnets, but it could be done somewhat okay if the right groups of pirates and streaming customers banded together.
The quote in the article says Turnstile does have proof-of-work (and space) challenges. But yes I've similarly wondered years ago why people weren't more aware of this idea for spam control. Instead people now invariably associate the term with cryptocurrency
The concept makes sense to me, it's reminiscent of hmac, but with the additional constraint that the "secret" key is in the open. And the server side verification is opaque so you have to hope it works. But I like that it's not captcha which has made web browsing worse over the years
I'm curious to know if there have been similar open source implementations of turnstile where website operators have found ways of limiting an API call to just a browser, without captcha. Does anyone know of any?
Recently fighting with bots in a different situation. I discovered you can return code "466" in nginx, which is a special code that completely disconnects the TCP session.
Like a Slow Loris attack, but from the server side? I like it!
I've been using a mostly-Apache setup for ages, but thinking about how it might be fun to implement something lightweight for my VPS, that includes a variety of ways to mess with those sending unwanted requests. I suppose ModSecurity could get me most of the way there without having to reinvent everything.
If you're still on iptables, you can TARPIT traffic using firewall rules that will essentially do that. nftables doesn't have tarpitting just yet, I believe.
If you want to annoy SSH brute forcing bots, endlessh is a dedicated tool for SSH connections. There are other tools for other dedicated protocols as well.
Cool, thanks! I do use fail2ban on my VPSs fairly liberally, so filling any one log with too much noise will trigger an hours-long ban for the IP.
What I liked about the application-level interference is that you can do something more subtle than a block, while still feeding them nonsense, slowly.
My second thought was utilizing some nodejs express reverse proxy -- with some kind of rate limiting slow down, but the attack stopped and I moved on to something else.
I dont' want to victim blame here, but if your "ISP" uses cgnat it's not an ISP. It's a web service provider. You should probably get a real ISP.
That said, I have a real ISP, comcast, but comcast does MITM attacks on it's users so I have to tunnel everything through various VPS I rent. Which of course means I get hit with the same cloudflare blocks. And the invisible javascript ones just go in loops no matter how many times I complete the visible side. I just close cloudflare hidden sites' tabs' now. The problem is more and more of the web is hidden behind their computational paywall... even academic journals now.
> I dont' want to victim blame here, but if your "ISP" uses cgnat it's not an ISP. It's a web service provider. You should probably get a real ISP.
Okay, so that... that is victim blaming. If you don't want to do that, stop doing it. Besides, do you really think someone in this situation can just get a real ISP? I mean, maybe they're in a competitive market and just managed to pick a bad option, but it's unlikely.
>Besides, do you really think someone in this situation can just get a real ISP?
Yes, I know at least 2 such people who chose to have a wireless ISP despite having a wired option available and affordable. One of them is even slightly technical. They usually don't run into the limits, but when they do it might be the right time to give them a nudge to get a real ISP. I was hoping that'd be the case here.
I think it's been here for much longer. Back in the day, IP addresses would just get blocked; fail2ban was a recommended tool for any web server back in the day. Getting flagged as a bot sucks (I've experienced it for a few days myself) but the modern CAPTCHA solutions are a lot better than the "server did not respond" days from before.
ISPs still failing to implement modern networks and sticking with broken workarounds like CGNAT are as much to blame as the bots tainting their CGNAT IP addresses.
No doubt, but at least in the UK nobody is forcing ISPs to modernise their infrastructure because competition seems to be entirely on price rather than service.
"And then, unsurprisingly in retrospect, it started to be abused so I had to put a rate limit on it. Problem is, that was a very rudimentary IP-based rate limit and it could be circumvented by someone with enough IPs, so fast forward a bit further and I put auth on the API which required a nominal payment to access it."
If one does not know how to limit based on number of HTTP requests, then is one really qualified to set up a "public API".
It is well-known that IP-based limits, i.e., blocklists, do not work. (While allowlists are common, e.g., academic journals, Verisign zone files, etc.)
We cannot blame the public because someone does not know how to configure a proxy to limit number of HTTP requests per IP, e.g., in a 24-hour period.
Here, the public is penalised by asking for credit card numbers because the site operator does not know how to count HTTP requests.
One would think people who do not want their details in a data breach that random people can download probably would not want to give their details to some random person whose website becomes popular. But it seems they do.
HIBP never made any sense. "Send me your private info and I will check and make sure it has not been leaked. Too late. You just leaked it to me." This sort of obvious blunder had to be fixed. Still too late for anyone who used it before HIBP was "fixed". Why place any confidence in someone who cannot spot these issues. Here, he struggles to implement an API.
Stupid websites can become popular. It happens. Many websites have sought to exploit data breaches. What better way to collect working email addresses that people care about than to let people submit them to you to check against a dump of a data breach. Unless they download the dump themselves, they have no way to confirm you actually checked anything. And if they did download the dump(s), then there is no reason to submit anything to you via an "HIBP" website.
If people are getting charged for API access, and submitted personal info to HIBP, then they should get some enforceable terms in return. If you collect peoples' information then you are liable for the damage that may result if HIBP is breached. Doubtful the API customers get any such protections.
Did I miss something? I don't understand what the "non-interactive challenge" is here:
> The widget takes responsibility for running the non-interactive challenge and returning a token
I get that a widget can return a token and then you trust the token and do the rest. But what determines whether a caller receives a tokens or not? That is, what's the actual challenge? That is, how is the distinction between bot vs. real human user actually getting decided?
If you're just pinging the endpoint without providing the necessary header then you are effectively a bot in this case. If you go through the motions to fill out the form and then submit it you're at least not explicitly telling the service you are going against its wishes by pinging the API directly.
As for the challenge, that seems to be within Cloudflare's implementation which then returns the token to submit with the form. HIBP then verifies the token to make sure its a match before checking for the data and sending a response.
Probably a form of proof of work which is costly for the bots but fair on normal users (combined with normal fingerprinting and IP check which determines how hard the challenge should be for a request)
For something like this service, simple rate limiting per IP / netblock, along with TCP/IP fingerprinting for VPN endpoints and such, could be very effective.
I run histograms for connections per netblock on my email servers, and even removing only the most egregious attempts at abuse almost empties my logs.
On the other hand, Cloudflare has issues with tons of less popular networks, with VPNs, with less affluent countries, with non-mainstream OSes and browsers, et cetera, all of which ends up punishing many people in ways that are completely disproportionate to the amount of abuse avoided.
It reminds me of the quote from fortune(6):
As far as we know, our computer has never had an undetected error.
-- Weisert
You don't know how many people Cloudflare has marginalized because you don't see their visits.
https://imgur.com/a/K5z1X2R
It's a long .gif file which shows that cloudflare's website loads just fine, but HIBP is unusable. Thanks Troy and Cloudflare for making this (free) service unusable. It's free, so I shouldn't expect that it works, anyway. Chrome on Linux and no VPN, fwiw.