It just goes to show the difference between best practices in cloud computing, a...

mrbungie · 2025-10-20T12:58:41 1760965121

> It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality,

Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.

spyspy · 2025-10-20T13:41:22 1760967682

Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.

pjmlp · 2025-10-20T15:42:15 1760974935

Indeed, yet one would expect AWS to lead by example, including all of those that are only using a single region.

spyspy · 2025-10-20T16:26:32 1760977592

Like any company over a handful of years old, I'm sure they have super old, super critical systems running they dare not touch for fear of torching the entire business. For all we know they were trying to update one of those systems to be more resilient last night and things went south.

pjmlp · 2025-10-20T16:54:36 1760979276

So many high profile companies with old deployments stuck in a single region, then.

Thaxll · 2025-10-20T14:12:49 1760969569

Best practice does not include plan for when AWS going down. Netflix does not plan for it and they have a very strong eng org.

immibis · 2025-10-21T06:41:48 1761028908

Did they stop their Chaos Gorilla, which simulates a region outage?

pjmlp · 2025-10-20T15:41:14 1760974874

It was only one region.

esafak · 2025-10-20T13:20:08 1760966408

Does AWS follow its own Well-Architected Framework!?