I think rate limiting is the wrong idea. Say for example, a client wants to re-fetch everything that it has cached, it may send a burst of requests in a short amount of time, and some of those requests may be wrongly rejected due to rate limits. This is what happens when a browser refreshes a page for example.
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.
This is why I've suggested, and used, fair queuing for rate limiting.
Any source can make lots of requests, but they queue up by IP address. (In practice, you have queues organized by hashed IP address, which you service round-robin. That's how routers do fair queuing.) If one queue gets very long, then you start sending 429 errors for new adds to that queue. Otherwise, the requests from the heavy load source just take a little longer.
Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.
The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.
This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.
Once you start getting into quality-of-service, basic rate limiting like the discussed is not enough. I think Stripe did a better job of covering such concerns in their rate limiter post. They specifically talk about being lenient to "bursty" traffic, as that was a legitimate use-case for clients.
Yes, this is why you want to use a token bucket rate limiter (which for some reason was considered and rejected by the original post). We wrote it up here and have an open-source impl on github that's in production serving hundreds of millions of rate limits: https://medium.com/smyte/rate-limiter-df3408325846
I particularly liked a comment about a fairly elegant way to rate limit with a caching proxy[1] in that HN post. When your main application returns a rate limit reached request, cache that in your proxy for a certain amount of time, and once it's timed out of the cache, they'll hit your main app again.
Where do you queue those requests ? If you do it anywhere under your own control, you will accumulate memory and open connections.
If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.
One could limit concurrent connections per address. The idea is that an OPTIONS request which doesn't take much resources at all could be treated with a different weight than a POST which costs the server time to process.
depending on the resources being thrown at you, just trying to limit concurrent connections per address could help (there's the question about how to keep information about how many connections a given address has opened distributed to your load balancing layer and consistent, or at least consistent enough to mostly do the right thing most of the time)
maybe that doesn't help with ipv6, though. you'd run out of memory if you tried to keep track of every /128, and different ISPs hand out different blocks to customers (some give out a /64, maybe some give each customer their own /72, etc).
> The idea is that an OPTIONS request which doesn't take much resources at all could be treated with a different weight than a POST which costs the server time to process.
I'm curious, what frameworks/libraries/whatever have you seen that use HTTP OPTIONS ?
More sophisticated ways to reject connections rely on heuristics and may result in denying legitimate requests. I don't have all the answers, I'm just suggesting a way that doesn't consider all requests equal.
Every CORS pre-flight request uses the OPTIONS method. It is also used to advertise which methods are available on a route.
This depends why you want to rate limit. If its the classic case of stopping one customer consuming all resources because of an overenthusiastic script it's fine. If you're trying to avoid paying for lots of calls to an expensive API, or competitors scraping data, you want to reject the requests entirely rather than just delaying them.
fwiw TokenBuckets are very happy to separate "burst" from "refill rate". ie It's easy to give each client 10 tokens, but then have their bucket refill at 60/minute.
Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.