Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.



Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.


imagine if the electricity supplier too that stance.


That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.

The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.


> But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.

it is fine because the electricity supplier is so good today that people don't see it going down as a risk.

Look at south africa's electricity supplier for a different scenario.


But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in


No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.


Spain and Portugal had a massive power outage this spring, no?


Yeah, and it has a 30 page Wikipedia article with 161 sources (https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...). Does that seem like a common occurrence?


> Sometimes weather or a car wreck takes out power

Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.

Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.

Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.


I think this is right, but depending on where you live, local weather-related outages can still not-infrequently look like entire towns going dark for a couple days, not streets for hours.

(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)


The electric grid is much more important than most private sector software projects by an order of magnitude.

Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.


What if the electricity grid depends on some AWS service?


That would be circular dependency.


The grid actually already has a fair number of (non-software) circular dependencies. This is why they have black start [1] procedures and run drills of those procedures. Or should, at least; there have been high profile outages recently that have exposed holes in these plans [2].

1. https://en.wikipedia.org/wiki/Black_start 2. https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...


And?


It doesn't though? Weird what if


You'd be surprised. See. GP asks a very interesting question. And some grid infra indeed relies on AWS, definitely not all of it but there are some aspects of it that are hosted by AWS.


I worked for an energy tech startup that did consulting for big utility companies. It absolutely does.


do you know for sure? And if not yet, you can bet someone will propose it in the future. So not a weird what if at all


This is already happening. I have looked at quite a few companies in the energy space this year, two of them had AWS as a critical dependency in their primary business processes and that could definitely have an impact on the grid. To their defense: AWS presumably tests their fall-back options (generators) with some regularity. But this isn't a farfetched idea at all.


Isn't that basically Texas?


Texas is like if you ran your cloud entirely in SharePoint.


Let's not insult SharePoint like that.

It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.


Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.


Fortunately nearly all services running on AWS aren't as important as the electric utility, so this argument is not particularly relevant.

And regardless, electric service all over the world goes down for minutes or hours all the time.


Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?

Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.

Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.

This is why "risk assessments" are a thing.


> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".


> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.

In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"


> nobody of any competency runs stuff on AWS complacently.

I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.

That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.

> As the meme goes "one does not simply just run something on AWS"

The meme has currency for a reason, unfortunately.

---

That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.


Nobody was suggesting that loss of utilities is a result of bad risk management. We are saying that all competent businesses run risk management and for most businesses, the cost of AWS being down is less than the cost of going multi cloud.

This is particularly true when Amazon hand out credits like candy. So you just need to moan to your AWS account manager about the service interruption and you’ll be covered.


> imagine if the electricity supplier too that stance.

Imagine if the cloud supplier was actually as important as the electricity supplier.

But since you mention it, there are instances of this and provisions for getting back up and running:

* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...

* https://en.wikipedia.org/wiki/Northeast_blackout_of_2003


and how many times have aws gone down majorly like that? I think you wouldn't be able to count it.


* https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

As someone who lives in Ontario, Canada, I got hit by the 2003 grid outage, which is once in >20 years. Seems like a fairly good uptime to me.

(Each electrical grid can perhaps be considered analogous to a separate cloud provider. Or perhaps, in US/CA, regions:

* https://en.wikipedia.org/wiki/North_American_Electric_Reliab...

)


It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.


Well technically AWS has never failed in wartime.


I don't understand, peacetime?


Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.


Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.

AWS (over-)reliance is insane...


What about being actively attacked by multinational state or an empire? Does it count or not?

Why people keep using "nation-state" term incorrectly in HN comments is beyond me...


I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?


As someone with a political science degree whose secondary focus was international relations, "Nation-state" has a number of different, definitions, an (despite the fact that dictionaries often don't include it), one of the most commonly encountered for a very long time has been "one of the principle subjects of international law, held to possess what is popularly, but somewhat inaccuratedly, referred to as Westphalian sovereignty" (there is a historical connection between this use and the "state roughly correlating with single nation" sense that relates to the evolution of “Westphalian sovvereignty” as a norm, but that’s really neither here nor there, because the meaning would be the meaning regardless of its connection to the other meaning.)

You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.


It sounds more technical than “country” and is therefore better


To me it sounds more like saying regime instead of government, gives off a sense of distance and danger.


Not really: nations state level actor: a hacker group funded by a country, not necessarily directly part of that country's government but at the same time kept at arms length for deniability purposes. For instance, hacking groups operating from China, North Korea, Iran and Russia are often doing this with the tacit approval and often funding from the countries they operate in, but are not part of the 'official' government. Obviously the various secret services in so far as they have personnel engaged in targeted hacks are also nation state level actors.


It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.


It makes a lot more sense if they had a typo of peak


Its a different kind of outage when the government disconnects you from the internet. Happens all the time, just not yet in the US.


> Not very many people realize that there are some services that still run only in us-east-1.

The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.


> there are some services that still run only in us-east-1.

What are those ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: