> the occasional outage isn't worth the cost and effort of moving out. And looke...

jedberg · 2025-10-20T21:50:07 1760997007

You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.

It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.

Perseids · 2025-10-21T06:29:58 1761028198

> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.

[1] Except with cash – might be worth to keep a stash handy for such purposes.

bartread · 2025-10-22T17:10:22 1761153022

Yeah, exactly this. I don’t know why the person who responded to me is talking about survivorship bias… and I suppose I don’t really care because there’s a bigger point.

The internet was originally intended to be decentralised. That decentralisation begets resilience.

That’s exactly the opposite of what we saw with this outage. AWS has give or take 30% of the infra market, including many nationally or globally well known companies… which meant the outage caused huge global disruption of services that many, many people and organisations use on a day to day basis.

Choosing AWS, squinted at through a somewhat particular pair of operational and financial spectacles, can often make sense. Certainly it’s a default cloud option in many orgs, and always in contention to be considered by everyone else.

But my contention is that at a higher level than individual orgs - at a societal level - that does not make sense. And it’s just not OK for government and business to be disrupted on a global scale because one provider had a problem. Hence my comment on legislators.

It is super weird to me that, apparently, that’s an unorthodox and unreasonable viewpoint.

But you’ve described it very elegantly: 99.99% (or pick the number of 9s you want) uptime with uncorrelated outages is way better than that same uptime with correlated, and particularly heavily correlated, outages.

bartread · 2025-10-20T23:38:54 1761003534

That’s a pretty bold claim. Where’s your data to back it up?

More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.

And then finally the usual outcome of increased competition is to improve the quality of products and services.

I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.

AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.

And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.

mk89 · 2025-10-21T04:00:58 1761019258

This is an assumption.

Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.

I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.

ytpete · 2025-10-21T00:33:55 1761006835

From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.

That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.

bartread · 2025-10-22T17:22:12 1761153732

Yeah, but this is exactly not what the internet is supposed to be. It’s supposed to be decentralised. It’s supposed to be resilient.

And at this point I’m looking at the problem and thinking, “how do we do that other than by legislating?”

Because left to their own devices a concerningly large number of people across many, many organisations simply follow the herd.

In the midst of a degrading global security situation I would have thought it would be obvious why that’s a bad idea.