And you will still be exposed to being blindsided by something out of your contr...

zzzcpan · on July 18, 2020

Honestly, you are not making any sense. This is not how engineering works. If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world. Furthermore, with resilience you have to always cover all risks, it's just that you don't immediately reach fine granularity of decisions that don't trigger failover to servers in different countries, you improve granularity as you learn from actual operations and modify your designs accordingly.

I remember when I first deployed DNS routed system it was too reactive, constantly jumping between servers, monitoring was too sensitive, it didn't wait for servers to stabilize to return them into the mix and exponential backoff was taking servers out for far too long. But even given all that it was still able to avoid outages caused by data center failures and connectivity problems.

katzgrau · on July 18, 2020

It does make sense, and it's paradoxical, I know.

> If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world.

You simply can't foresee or eliminate all risk. This is referred to as "the turkey problem." It's not my idea, but one I certainly subscribe to.

https://www.convexresearch.com.br/en/insights/the-turkey-pro...

zzzcpan · on July 18, 2020

The whole idea behind resilience is to cover unforeseeable risks, the turkey problem just doesn't apply here. I would even say if the system doesn't solve the turkey problem it cannot be called resilient. And high availability without resilience is not practically possible.

katzgrau · on July 18, 2020

> The whole idea behind resilience is to cover unforeseeable risks

Speaking of things that don't make sense... if it's unforeseeable, one will have a difficult time adequately preparing for it

zzzcpan · on July 18, 2020

It's not difficult, it's just different. It's the difference between predicting that a truck might crash into a data center and building concrete wall around it, and designing a system in a such way that users only ever resolve to servers that are currently available regardless of what happened to some of them in a data center that had a truck crashed into it.

katzgrau · on July 18, 2020

... and after you've solved for the truck problem, you have a potentially infinite list of other things to plan for, some of which you will not foresee. And of course, there's probably an upper bound on the time you can spend preparing for such things.

Famous to the point of being a cliche, the titanic was thought to be unsinkable, and I would have a similarly hard time convincing the engineers behind the ship's design to believe otherwise.

The level of confidence you're displaying in predicting the unforeseeable is something you may want to take a deeper look at.

zzzcpan · on July 18, 2020

You are missing the point. Solving the truck problem is exactly what you shouldn't do, well, at least until your system is resilient. Because it could be something entirely different, it could be law enforcement raiding a data center and your wall around it won't protect it from them. So instead you approach the system in terms of what it has to rely on and all possible states of the thing it has to rely on. Which maps to a very small number of decisions. Like whether a server is available or not. If it's not available it really doesn't matter which of the infinite things that could happen to it or to a data center it is in actually did, you simply don't return it to users if it's not available and have enough independent servers to return to users in enough independent data centers to achieve specific availability. It's really not difficult.

I understand that most of those leetcode corporations don't care much about resilience, likely even incapable of producing highly reliable systems, and may give you a false impression that reliability is something of an unachievable fantasy. But it's not, it's something we have enough research done on and can do really well today if needed, we are not in titanic era anymore.

I have high confidence in these things (not in "predicting the unforeseeable"), because I've done them myself. My edge infrastructure had like half an hour of downtime total in many years, almost a decade already.