Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have worked on event driven systems which read from queues for my entire career, messaging then Kafka.

Surely in most business scenarios letting data back up on queues for a while suffices?

In the situations where it doesn’t, add some more capacity.

Back pressure is something I’ve never had to deal with even in HFT and electronic trading situations.



Without backpressure, queues only delay the inevitable. At best. Queuing delays will increase latency. Extra memory pressure can even make things worse by decreasing throughput. If latency gets high enough, clients will start to error out or misbehave in all sorts of unpredictable ways.

> In the situations where it doesn’t, add some more capacity.

That obviously has its limits too. Some people actually care about cost. Also, added capacity might be totally wasted if there's any sort of affinity between servers and data. Maybe it works if you're in a stateless middle tier, but not e.g. in a storage system. You can add capacity to those, sure, but rebalancing the data to take advantage of the new capacity can be a lengthy process and compete with user traffic (making user-perceived latency even worse) if you're not careful. I've also seen too many cases where rebalancing undid the careful work done during initial data placement to maximize data safety. Oops.

> Back pressure is something I’ve never had to deal with

Count your blessings.


Letting data back up in unbounded queues means (a) you have giant ass queues sitting around hogging resources, (b) upstream components are now going to see service times go through the roof as shit sits around doing nothing in these queues, and (c) upstream clients have no way to predict/avoid these service time spikes due to queueing (vs. say, the operations just take a while) and prevents them making intelligent decisions about their own work which could avoid these service time spikes.

I've worked in the embedded space now for about a decade (network, storage, databases). Everywhere I've been where the product has no backpressure, the product inevitably suffers from the above issues, which translate directly into customer dissatisfaction. And it's never easy to add backpressure to a system which doesn't have it.

It's also not always an option to just "add more capacity" -- there are many situations where that's not possible, or requires time to enact. Meanwhile, your queues are growing without bound.

Build backpressure in from the start.


Queues are fine, but:

1) Queues should be of finite size (exactly what size is entirely application dependent, but the point is that it's almost never ok to let a queue grow indefinitely).

2) When a queue is full, producers should often be blocked (backpressure).

Obviously it depends a lot on the particular use case, but IME it's usually risky to think that a 'sufficiently large' queue is going to be an acceptable solution.


I've worked with systems where petabytes are flowing through it every day. If your data starts backing up, it costs lots of real money in terms of hard disk space. Backpressure crosses organizational boundaries--the team running the system which consumes data is not the same team running the system which produces data. We exposed backpressure through our APIs... if our system was overloaded, we returned a specific error code. Clients were expected to retry, sensibly.

Sometimes "add more capacity" just costs too much.


Failure codes themselves aren't back pressure though, its the client side "retry later" which creates that effect. But, if you have that behavior, its better just to signal it directly because a simple fail, back off, retry system actually increases overall load due to duplicate requests retrying. And if its gets bad enough, potentially results in livelock like behaviors. Simply slowing down ACK's or explicit "item queued, please wait before sending another" return codes are far more effective. See buffer crediting, flow control.

OTOH, I'm sort of amazed by the people who reach for queues to solve loading issues without first assuring that the system works properly at full load without any queuing. Queue's only serve to smooth out busty behavior and come with their own problems (buffer bloat related latency).


> ... a simple fail, back off, retry system actually increases overall load due to duplicate requests retrying.

Well, two things.

First, our clients didn't have a "simple" back off, retry system. They tended to have more sophisticated back-off and retry systems that didn't increase pressure on services that were overloaded. There are a number of different techniques you use to accomplish this and you can wrap up several of these techniques in a library, use the library in your client code, and have that client code (a fat client) be the official API for your service.

Second, this also assumes that front-end load is significant to begin with. In our system, it definitely wasn't. Requests were super cheap. You could hammer the front-end extremely hard and it would just respond with "try again later" until there was some capacity available, and then it would let in some work. The size of an actual request was quite small, it was the size of the work represented by that request that was large.

> OTOH, I'm sort of amazed by the people who reach for queues to solve loading issues without first assuring that the system works properly at full load without any queuing. Queue's only serve to smooth out busty behavior and come with their own problems (buffer bloat related latency).

Queues in our system were a necessary part of the design. The system would not operate efficiently without them, because work items needed to be batched in order to be processed efficiently. I don't think there's a way that you can indict a system for using queues unless you know something about the design requirements.

There are absolutely other reasons to use queues other than smoothing out bursty behavior.


Obviously you've got to pay for something. You can either increase your compute capacity to process requests more quickly, pay for storage on a queue to regulate flow, or do some kind of internal code optimization to decrease resource demands. All of which cost money one way or the other. Alternately, you go up the food chain and determine the ROI on all the data you're receiving and maybe stop collecting some things that aren't valuable. If they are valuable, then it's a no-brainer to pay what you have to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: