Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And you will still be exposed to being blindsided by something out of your control. It's really only in your control of you can think of and plan for it ahead of time. And there will certainly be things that we don't consider. You can call that a failure but it happens all the time and it's reality.

What if a political event impacts you, for instance? A pandemic? A storm taking out a major data center? A weird Linux kernel edge case that only happens beyond a certain point in time? That only sounds ridiculous because it hasn't happened, but weird things like that happen all the time. There are so many unseen possibilities.

I understand that might sound unreasonable or facetious or like I'm expanding the scope.

The point is, the more confident that you've built something that has no SPOF the more exposed your are to the risk of it, because one probably does exist.



Honestly, you are not making any sense. This is not how engineering works. If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world. Furthermore, with resilience you have to always cover all risks, it's just that you don't immediately reach fine granularity of decisions that don't trigger failover to servers in different countries, you improve granularity as you learn from actual operations and modify your designs accordingly.

I remember when I first deployed DNS routed system it was too reactive, constantly jumping between servers, monitoring was too sensitive, it didn't wait for servers to stabilize to return them into the mix and exponential backoff was taking servers out for far too long. But even given all that it was still able to avoid outages caused by data center failures and connectivity problems.


It does make sense, and it's paradoxical, I know.

> If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world.

You simply can't foresee or eliminate all risk. This is referred to as "the turkey problem." It's not my idea, but one I certainly subscribe to.

https://www.convexresearch.com.br/en/insights/the-turkey-pro...


The whole idea behind resilience is to cover unforeseeable risks, the turkey problem just doesn't apply here. I would even say if the system doesn't solve the turkey problem it cannot be called resilient. And high availability without resilience is not practically possible.


> The whole idea behind resilience is to cover unforeseeable risks

Speaking of things that don't make sense... if it's unforeseeable, one will have a difficult time adequately preparing for it


It's not difficult, it's just different. It's the difference between predicting that a truck might crash into a data center and building concrete wall around it, and designing a system in a such way that users only ever resolve to servers that are currently available regardless of what happened to some of them in a data center that had a truck crashed into it.


... and after you've solved for the truck problem, you have a potentially infinite list of other things to plan for, some of which you will not foresee. And of course, there's probably an upper bound on the time you can spend preparing for such things.

Famous to the point of being a cliche, the titanic was thought to be unsinkable, and I would have a similarly hard time convincing the engineers behind the ship's design to believe otherwise.

The level of confidence you're displaying in predicting the unforeseeable is something you may want to take a deeper look at.


You are missing the point. Solving the truck problem is exactly what you shouldn't do, well, at least until your system is resilient. Because it could be something entirely different, it could be law enforcement raiding a data center and your wall around it won't protect it from them. So instead you approach the system in terms of what it has to rely on and all possible states of the thing it has to rely on. Which maps to a very small number of decisions. Like whether a server is available or not. If it's not available it really doesn't matter which of the infinite things that could happen to it or to a data center it is in actually did, you simply don't return it to users if it's not available and have enough independent servers to return to users in enough independent data centers to achieve specific availability. It's really not difficult.

I understand that most of those leetcode corporations don't care much about resilience, likely even incapable of producing highly reliable systems, and may give you a false impression that reliability is something of an unachievable fantasy. But it's not, it's something we have enough research done on and can do really well today if needed, we are not in titanic era anymore.

I have high confidence in these things (not in "predicting the unforeseeable"), because I've done them myself. My edge infrastructure had like half an hour of downtime total in many years, almost a decade already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: