Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”



Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.


> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A


It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)


Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.


Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.


I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.

At risk of more snark [well-intentioned]: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.


Ya, I totally believe that cloud platforms don't need a single point of failure. In fact, seeing the vulnerability makes me excited, because I realize there is _still_ potential for innovation in this area! To be fair it's not my area of expertise, so I'm very unlikely to be involved, but it's still exciting to see more change on the horizon :)


Others have raised good points, like: they've already won, why bother? We did it because we weren't first!


What company did you do it with, can you say? Definitely, they may have been an early mover, but they can (and I'll say will!) still be displaced eventually, that's how business goes.


It's fine if someone guesses the well-known company, but I can't confirm/deny; like privacy a bit too much/post a bit too spicy. This wasn't a darling VC thing, to be fair. Overstated my involvement with 'made' for effect. A lot of us did the building and testing.


Definitely, that makes sense. Ya no worries at all, I think we all know these kinds of things involve 100+ human work-years, so at best we all just have some contribution to them.


> think we all know these kinds of things involve 100+ human work-years

No kidding! The customers differ, business/finance/governments, but the volume [systems/time/effort] was comparable to Amazon. The people involved in audits were consumed practically for a whole quarter, if memory serves. Not necessarily for testing itself: first, planning, sharing the plan, then dreading the plan.

Anyway, I don't miss doing this at all. Didn't mean to imply mitigation is trivial, just feasible :) 'AWS scale' is all the more reason to do business continuity/disaster recovery testing! I guess I find it being surprising, surprising.

Competitors have an easier time avoiding the creation of a Gordian Knot with their services... when they aren't making a new one every week. There are significant degrees to PaaS, a little focus [not bound to a promotion packet] goes a long way.


You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?


Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.

Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.

edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.


There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.


The sites were chosen specifically to be more than 50 miles apart, it proved plenty.


I am the CEO of your company. I forgot to pay the electricity bill. How is the multi-region resilience going?


If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.


I suspect 'whatever1' can't be satisfied, there are no silver bullets. There's always a bigger fish/thing to fail.

The goal posts were fine: bomb the AZ of your choice, I don't care. The Cloud [that isn't AWS, in the case of 'us-east-1'] will still work.


No. It’s just that in my entire career when anyone claims that they have the perfect solution to a tough problem, it means either that they are selling something, or that they haven’t done their homework. Sometimes it’s both.


For what's left of your career: sometimes it's neither. You're confused, perfection? Where? A past employer, who I've deliberately not named, is selling something: I've moved on. Their cloud was designed with multiple-zone regions, and importantly, realizes the benefit: respects the boundaries. Amazon, and you, apparently have not.

Yes, everything has a weakness. Not every weakness is comparable to 'us-east-1'. Ours was billing/IAM. Guess what? They lived in several places with effective and routinely exercised redundancy. No single zone held this much influence. Service? Yes, that's why they span zones.

Said in the absolute kindest way: please fuck off. I have nothing to prove or, worse, sell. The businesses have done enough.


This is not what the resilience expert stated.


If your accounts payable can’t pay the electric bill on time, you’ve got bigger problems.


Yea, let's play along. Our CEO is personally choosing to not pay any entire class of partners across the planet. Are we even still in business? I'm so much more worried about being paid than this line of questioning.

A Cloud with multiple regions, or zones for that matter, that depend on one is a poorly designed Cloud; mine didn't, AWS does. So, let's revisit what brought 'whatever1', here:

> Your experiment proves nothing. Anyone can pull it off.

Amazon didn't, we did. Hmm.


Fine, our overseas offices are different companies and bills are paid for by different people.

Not that "forgot to pay" is going to result in a cut off - that doesn't happen with the multi-megawatt supplies from multiple suppliers that go into a dedicated data centre. It's far more likely that the receivers will have taken over and will pay the bill by that point.


Fine, the tab increments. Get back to hyping or something, this is not your job.


I doubt it should be yours if this is how you think about resilience.


Your vote has been tallied


Same failure mode of anything else.

How’s not paying your AWS bill going for you?


if the ceo of your company is personally paying the electric bill, go work for another company :)


Interesting. Langley isn’t that far away


Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.


Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.


I'm guessing Azure which may technically have greater resilience but has dogshit support and UX.


Azure, from my experience with it has stuff go down a lot and degrades even more. Seems to either not admit the degradation happened or rely on 1000 pages of fine print SLA docs to prove you don't get any credits for it. I suppose that isn't the same as "lose a region resiliency" so it could still be them given the poster said it is B2B focused and Azure is subject to a lot of exercises like this from it's huge enterprise customers. FWIW I worked as a IaC / devops engineer with the largest tenant in one of the non-public Azure clouds.


AWS is not cheap. AWS is one to two orders of magnitude more expensive than DIY.


My $3/mo AWS instance is far cheaper than any DIY solution I could come up with, especially when I have to buy the hardware and supply the power/network/storage/physical space. Not to mention it's not worth my time to DIY something like that in the first place.

There can be other valid usecases than your own.


Small things are cheap, yes, news at 11. But did you compare what your $3-$5 gets at Amazon vs a more traditional provider?


False equivalence/moving goalposts IMO... I was only refuting your claim of "AWS is not cheap", as if it's somehow impossible for it to be cheap... which I'm saying isn't the case.


Sorry to jump in y'alls convo :) AWS is cheaper than the Cloud we built... I just don't think it's significant. Ours cost more because businesses/governments would pay it, not because it was optimal.

Price is beside my original point: Amazon has enjoyed decades for arbitrage. This sounds more accusatory than intended: the 'us-south-1' problem exists because it's allowed/chosen. Created in 2006!

Now, to retract that a bit: I could see technical debt/culture making this state of affairs practical, if not inevitable. Correct? No, if I was Papa Bezos I'd be incredibly upset my Supercomputer is so hamstrung. I think even the warehouses were impacted!

The real differentiator was policy/procedure. Nobody was allowed to create a service or integration with this kind of blast area. Design principles, to say the least. Fault zones and availability zones exist for a reason beyond capacity, after all.


It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.


Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].


It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)


The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.

Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.


This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?


It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic


They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.


You mean the surveillance angle as reason for it being in Virginia?


AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.

It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.


So the control plane for DNS and the identity management system are tied to us-east-1 and we’re supposed to think that’s OK? Those seem like exactly the sorts of things that should NOT be reliant on only one region.


It's worse than that. The entire DNS ultimately depends on literally one box with the signing key for the root zone.

You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.


not quite true - there are some regions that have a different set of AWS users / credentials. I can't remember what this is called off the top of my head.


These are different AWS partitions. They are completely separate from each other, requiring separate accounts and credentials.

There's one for China, one for the AWS government cloud, and there are also various private clouds (like the one hosting the CIA data). You can check their list in the JSON metadata that is used to build the AWS clients (e.g. https://github.com/aws/aws-sdk-go-v2/blob/1a7301b01cbf7e74e4... ).


The parent seems to be implying there is something in us-east-1 that could take down all the various regions?


What is the motivation of an effective Monopoly to do anything?

I mean look at their console. Their console application is pretty subpar.


My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.


You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.


> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.

That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.

But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.


How many businesses can’t afford to suffer any downtime though?

But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc

And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.


As someone who hypothetically runs a critical service, I would rather my service be up than down.


And you have never had downtime? If your data center went down - then what?


I'm saying the importance is on uptime, not on who to blame, when services are critical.

You don't have one data center with critical services. You know lots of companies are still not in the cloud, and they manage their own datacenters, and they have 2-3 of them. There are cost, support, availability and regulatory reasons not to be in the cloud for many parties.


Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.


It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.


Total downtime would likely be the same or more.


Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.


I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.


I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...


I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"

The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"


I think this could still be a very useful question for an interviewer. If I were hiring for a position working on a complex system, I would want to know what level of complexity a prospect was comfortable dealing with.


I was once very unpopular with a team of developers when I pointed out a complete solution to what they had decided was an "interesting" problem - my solution didn't involve any code being written.


I suppose it depends on what you are interviewing for but questions like that I assume are asked more to see how you answer than the specifics of what you say.

Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.


Yeah, I think this. I've asked this in interviews before, and it's less about who has done the most complicated thing and more about the candidate's ability to a) identify complexity, and b) avoid unnecessary complexity.

I.e. a complicated but required system is fine (I had to implement a consensus algorithm for a good reason).

A complicated but unrequired system is bad (I built a docs platform for us that requires a 30-step build process, but yeah, MkDocs would do the same thing.

I really like it when people can pick out hidden complexity, though. "DNS" or "network routing" or "Kubernetes" or etc are great answers to me, assuming they've done something meaningful with them. The value is self-evident, and they're almost certainly more complex than anything most of us have worked on. I think there's a lot of value to being able to pick out that a task was simple because of leveraging something complex.


That's what arbitrary means to me, but sure, I see no problem calling it political too


Forced attrition rears its head again


>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late


It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.


Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them


Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.


Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.


Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.


That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.


Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.


They absolutely do do it themselves..


What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.


The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)


There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.


That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.


Yes it's hypocritical to push customers to pay you more money with best practices for uptime when you yourself don't follow them and your choices to not follow them actually make the best practices you pushed your customers to pay you more money for not fully work.

Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).


They can't even bother to enable billing services in GovCloud regions.


Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.


This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud


Yes, although unfortunately it’s not how AWS sold regions to customers. AWS folks consistently told customers that regions were independent and customers architected on that belief.

It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.


Yes - they told me quite specifically that until they launch their their sovereign cloud, the mothership will be around.


Which are lies btw - Amazon has admitted the "EU sovereign cloud" is still susceptible to US government whims.


Then it will be eu-east-1 taking down the EU


gov, iso*, cn are also already separate (unless you need to mess with your bill, or certain kinds of support tickets)


There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.


FWIW, I tried creating a DNSSEC entry for one of my domains during the outage, and it worked just fine.


However these services don't need high write uptime.


It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.


> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )


even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point


Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.


It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS. If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.

I hope they release a good root cause analysis report.


It's not unavoidable for DNS. DNS is inherently eventually consistent anyway, due to time-based caching.


Sure, but you want to make sure that changes propagate as soon as possible from the central authority. And for AWS, the control plane for that authority happens to be placed in US-EAST-1. Maybe Blockchain technology can decentralize the control plane?


Or Paxos or Raft...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: