This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.
It really is a single point of failure for the majority of the Internet.
This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.
> Even if you don't run anything in AWS directly, something you integrate with will.
Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"
It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.
The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.
It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.
If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.
Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.
E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.
But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.
I guess it's a decision that can only be made on a case by case basis.
Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.
Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.
The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).
I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?
"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.
Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.
It really is a single point of failure for the majority of the Internet.