More

bravetraveler · 2025-10-21T14:22:08 1761056528

Their words, not mine; the first header: "It's already on your machine". We can belabor it, but the domain is 'justuse'. No room for 'except' [unless you're reasonable, of course].

The 'egregious' things are charging to share what will fit very well in SCM (preventing real automation)... and breaking due to Online First/only. It makes sense to require the endpoint I'm talking to. Why would Postman need AWS/us-east-1 [0] for a completely unrelated API? Joyful rent-seeking.

cURL, your suggestion (hurl), or HTTPie all make far more sense. Store what they need in the place where you already store stuff. Profit, for free: never go down. Automate/take people out of the button-pushing pattern for a gold star.

0: https://news.ycombinator.com/item?id=45645172

bravetraveler · 2025-10-21T13:15:33 1761052533

Ditto, enjoy the catharsis. Good advice to not take it personally, I'll try to give a less aggressive point of view. All of this has come to mind [but not repeated out of kindness or laziness, whichever].

So, to start: someone wants me to install Postman/similar and pay real money to share and make a request? Absolutely not. I can read the spec from Swagger, or whatever, too... and write down what was useful [for others]. We all have cURL or some version of Python.

Surely a few phrases of text worth making plans to save, and paying for [at least twice, you to research and them to store], are worth putting into source control. It's free, even gifts dividends. How? Automation that works faster than a human pushing a button. Or creates more buttons!

bravetraveler · 2025-10-21T08:01:00 1761033660

The wisdom of pipes! I'd share these workflows the exact same way we share others [ie: BASH, Ansible]: Git. Needs nothing more than a directory, though an SSH daemon is quite nice.

Those of us who can survive without desperate monetization plays are worth quite a lot, actually. They say 'jury rig', we say 'engineer'.

bravetraveler · 2025-10-21T05:35:45 1761024945

> I’m unaware of any common and popular distributed IDAM that is reliable

Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.

bravetraveler · 2025-10-20T13:22:02 1760966522

Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.

voxadam · 2025-10-20T13:23:48 1760966628

> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

Anon1096 · 2025-10-20T14:43:10 1760971390

It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)

bravetraveler · 2025-10-20T14:43:51 1760971431

Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.

pinkmuffinere · 2025-10-20T15:06:47 1760972807

Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.

bravetraveler · 2025-10-20T15:09:12 1760972952

I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.

At risk of more snark [well-intentioned]: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.

pinkmuffinere · 2025-10-20T16:10:49 1760976649

Ya, I totally believe that cloud platforms don't need a single point of failure. In fact, seeing the vulnerability makes me excited, because I realize there is _still_ potential for innovation in this area! To be fair it's not my area of expertise, so I'm very unlikely to be involved, but it's still exciting to see more change on the horizon :)

bravetraveler · 2025-10-20T16:19:54 1760977194

Others have raised good points, like: they've already won, why bother? We did it because we weren't first!

pinkmuffinere · 2025-10-20T16:27:38 1760977658

What company did you do it with, can you say? Definitely, they may have been an early mover, but they can (and I'll say will!) still be displaced eventually, that's how business goes.

bravetraveler · 2025-10-20T16:29:58 1760977798

It's fine if someone guesses the well-known company, but I can't confirm/deny; like privacy a bit too much/post a bit too spicy. This wasn't a darling VC thing, to be fair. Overstated my involvement with 'made' for effect. A lot of us did the building and testing.

pinkmuffinere · 2025-10-20T17:09:42 1760980182

Definitely, that makes sense. Ya no worries at all, I think we all know these kinds of things involve 100+ human work-years, so at best we all just have some contribution to them.

bravetraveler · 2025-10-20T20:25:35 1760991935

> think we all know these kinds of things involve 100+ human work-years

No kidding! The customers differ, business/finance/governments, but the volume [systems/time/effort] was comparable to Amazon. The people involved in audits were consumed practically for a whole quarter, if memory serves. Not necessarily for testing itself: first, planning, sharing the plan, then dreading the plan.

Anyway, I don't miss doing this at all. Didn't mean to imply mitigation is trivial, just feasible :) 'AWS scale' is all the more reason to do business continuity/disaster recovery testing! I guess I find it being surprising, surprising.

Competitors have an easier time avoiding the creation of a Gordian Knot with their services... when they aren't making a new one every week. There are significant degrees to PaaS, a little focus [not bound to a promotion packet] goes a long way.

jayd16 · 2025-10-20T15:18:14 1760973494

You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?

bravetraveler · 2025-10-20T15:19:57 1760973597

Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.

Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.

edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.

whatever1 · 2025-10-20T15:09:51 1760972991

There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.

bravetraveler · 2025-10-20T15:10:36 1760973036

The sites were chosen specifically to be more than 50 miles apart, it proved plenty.

whatever1 · 2025-10-20T15:11:57 1760973117

I am the CEO of your company. I forgot to pay the electricity bill. How is the multi-region resilience going?

icedchai · 2025-10-20T15:39:50 1760974790

If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.

bravetraveler · 2025-10-20T16:00:47 1760976047

I suspect 'whatever1' can't be satisfied, there are no silver bullets. There's always a bigger fish/thing to fail.

The goal posts were fine: bomb the AZ of your choice, I don't care. The Cloud [that isn't AWS, in the case of 'us-east-1'] will still work.

whatever1 · 2025-10-20T19:03:54 1760987034

No. It’s just that in my entire career when anyone claims that they have the perfect solution to a tough problem, it means either that they are selling something, or that they haven’t done their homework. Sometimes it’s both.

bravetraveler · 2025-10-20T19:06:14 1760987174

For what's left of your career: sometimes it's neither. You're confused, perfection? Where? A past employer, who I've deliberately not named, is selling something: I've moved on. Their cloud was designed with multiple-zone regions, and importantly, realizes the benefit: respects the boundaries. Amazon, and you, apparently have not.

Yes, everything has a weakness. Not every weakness is comparable to 'us-east-1'. Ours was billing/IAM. Guess what? They lived in several places with effective and routinely exercised redundancy. No single zone held this much influence. Service? Yes, that's why they span zones.

Said in the absolute kindest way: please fuck off. I have nothing to prove or, worse, sell. The businesses have done enough.

whatever1 · 2025-10-20T15:54:26 1760975666

This is not what the resilience expert stated.

quickthrowman · 2025-10-20T15:24:11 1760973851

If your accounts payable can’t pay the electric bill on time, you’ve got bigger problems.

bravetraveler · 2025-10-20T17:44:50 1760982290

Yea, let's play along. Our CEO is personally choosing to not pay any entire class of partners across the planet. Are we even still in business? I'm so much more worried about being paid than this line of questioning.

A Cloud with multiple regions, or zones for that matter, that depend on one is a poorly designed Cloud; mine didn't, AWS does. So, let's revisit what brought 'whatever1', here:

> Your experiment proves nothing. Anyone can pull it off.

Amazon didn't, we did. Hmm.

ta1243 · 2025-10-20T16:00:29 1760976029

Fine, our overseas offices are different companies and bills are paid for by different people.

Not that "forgot to pay" is going to result in a cut off - that doesn't happen with the multi-megawatt supplies from multiple suppliers that go into a dedicated data centre. It's far more likely that the receivers will have taken over and will pay the bill by that point.

bravetraveler · 2025-10-20T15:12:58 1760973178

Fine, the tab increments. Get back to hyping or something, this is not your job.

whatever1 · 2025-10-20T15:15:36 1760973336

I doubt it should be yours if this is how you think about resilience.

bravetraveler · 2025-10-20T15:16:28 1760973388

Your vote has been tallied

dijit · 2025-10-20T19:33:06 1760988786

Same failure mode of anything else.

How’s not paying your AWS bill going for you?

thomasjudge · 2025-10-21T13:49:46 1761054586

if the ceo of your company is personally paying the electric bill, go work for another company :)

jf · 2025-10-20T14:08:02 1760969282

Interesting. Langley isn’t that far away

stronglikedan · 2025-10-20T15:05:41 1760972741

Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.

bravetraveler · 2025-10-20T15:07:30 1760972850

Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.

franktankbank · 2025-10-20T16:08:21 1760976501

I'm guessing Azure which may technically have greater resilience but has dogshit support and UX.

seniorThrowaway · 2025-10-20T17:33:24 1760981604

Azure, from my experience with it has stuff go down a lot and degrades even more. Seems to either not admit the degradation happened or rely on 1000 pages of fine print SLA docs to prove you don't get any credits for it. I suppose that isn't the same as "lose a region resiliency" so it could still be them given the poster said it is B2B focused and Azure is subject to a lot of exercises like this from it's huge enterprise customers. FWIW I worked as a IaC / devops engineer with the largest tenant in one of the non-public Azure clouds.

immibis · 2025-10-21T06:33:55 1761028435

AWS is not cheap. AWS is one to two orders of magnitude more expensive than DIY.

ranger_danger · 2025-10-21T15:36:50 1761061010

My $3/mo AWS instance is far cheaper than any DIY solution I could come up with, especially when I have to buy the hardware and supply the power/network/storage/physical space. Not to mention it's not worth my time to DIY something like that in the first place.

There can be other valid usecases than your own.

immibis · 2025-10-21T18:29:37 1761071377

Small things are cheap, yes, news at 11. But did you compare what your $3-$5 gets at Amazon vs a more traditional provider?

ranger_danger · 2025-10-21T18:44:18 1761072258

False equivalence/moving goalposts IMO... I was only refuting your claim of "AWS is not cheap", as if it's somehow impossible for it to be cheap... which I'm saying isn't the case.

bravetraveler · 2025-10-21T19:39:18 1761075558

Sorry to jump in y'alls convo :) AWS is cheaper than the Cloud we built... I just don't think it's significant. Ours cost more because businesses/governments would pay it, not because it was optimal.

Price is beside my original point: Amazon has enjoyed decades for arbitrage. This sounds more accusatory than intended: the 'us-south-1' problem exists because it's allowed/chosen. Created in 2006!

Now, to retract that a bit: I could see technical debt/culture making this state of affairs practical, if not inevitable. Correct? No, if I was Papa Bezos I'd be incredibly upset my Supercomputer is so hamstrung. I think even the warehouses were impacted!

The real differentiator was policy/procedure. Nobody was allowed to create a service or integration with this kind of blast area. Design principles, to say the least. Fault zones and availability zones exist for a reason beyond capacity, after all.

nevir · 2025-10-20T14:42:32 1760971352

It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.

bravetraveler · 2025-10-20T14:52:04 1760971924

Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].

nevir · 2025-10-20T14:58:04 1760972284

It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)

bravetraveler · 2025-10-20T15:01:13 1760972473

The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.

Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.

bravetraveler · 2025-10-20T13:09:28 1760965768

Their auction systems are interesting to dig through, but to your point, everything fails. Especially these older auction systems. Great price/service, though. Less than an hour for more than one ad-hoc RAID card replacement

stavros · 2025-10-20T13:13:58 1760966038

Yeah, I really want one of their dedicated servers, but it's a bit too expensive for what I use it for. Plus, my server is too much of a pet, so I'm spoiled on the automatic full-machine backups.

bravetraveler · 2025-10-20T13:16:36 1760966196

Absolutely understandable :)

bravetraveler · 2025-10-20T09:24:17 1760952257

Pulp is a popular project for 'one stop shop', I believe. Personally, always used project-specific solutions like 'distribution/distribution' for containers from the CNCF. This allows for pull-through caching with relatively little setup work.

bravetraveler · 2025-10-20T09:16:15 1760951775

Pull-through caches are still useful even when the upstream is down... assuming the image(s) were pulled recently. The HEAD to upstream will obviously fail [when checking currency], but the software is happy to serve what it has already pulled.

Depends on the implementation, of course: I'm speaking to 'distribution/distribution', the reference. Harbor or whatever else may behave differently, I have no idea.

bravetraveler · 2025-10-17T19:38:48 1760729928

For those interested in such things, 11.0.6 is the current LTS

bravetraveler · 2025-10-15T19:55:24 1760558124

    "Challenge accepted."
      - Internet