It’s that time of the year again where we all realize that relying on AWS and Cloudflare to this degree is pretty dangerous but then again it’s difficult to switch at this point.
If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.
Unless you’re say at airport trying to file a luggage claim … or at the pharmacy trying to get your prescription. I think as a community we have a responsibility to do better than this.
I always see such negative responses when HN brings up software bloat ("why is your static site measured in megabytes").
Now that we have an abundance of compute and most people run devices more powerful than the devices that put man on the moon, it's easier than ever to make app bloat, especially when using a framework like Electron or React Native.
People take it personally when you say they write poor quality software, but it's not a personal attack, it's an observation of modern software practices.
And I'm guilty of this, mainly because I work for companies that prioritize speed of development over quality of software, and I suspect most developers are in this trap.
I think we have a new normal now though. Most web devs starting now don't know a world without React/Vue/Solid/whatever. Like, sure you can roll your own HTML site with JS for interactivity, but employers now don't seem to care about that; if you don't know React then don't bother.
You aren’t cloudflare’s customer in these examples. It depends on the companies that are actually paying for and using the service to complain. Odds are that they won’t care on your behalf due to how our society is structured.
Not really sure how our community is supposed to deal with this.
“We” are the ones making the architecture and the technical specs of these services. Taking care for it to still work when your favourite FAANGMC is down seems like something we can help with.
> If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.
Which only shows that chasing five 9s is worthless for almost all web products. The idea is that by relying on AWS or Cloudflare you can push your uptime numbers up to that standard, but these companies themselves are having such frequent outages that customers themselves don't expect that kind reliability from web products.
If I choose AWS/cloudflare and we're down with half of the internet, then I don't even need to explain it to my boss' bosses, because there will be an article in the mainstream media.
If I choose something else, we're down, and our competitors aren't, then my overlords will start asking a lot of questions.
Yup. AWS went down at a previous job and everyone basically took the day off and the company collectively chuckled. Cloudflare is interesting because most execs don’t know about it so I’d imagine they’d be less forgiving. “So what does cloudflare do for us exactly? Don’t we already have aws?”
Or _you_ aren't down, but a third-party you depend on is (auth0, payment gateway, what have you), and you invested a lot of time and effort into being reliable, but it was all for less than nothing, because your website loads but customers can't purchase, and they associate the problem with you, not with the AWS outage.
In reality it is not half of the internet. That is just marketing. I've personally noticed one news site while others were working. And I guess sites like that will get the blame.
Happy to hear anyone's suggestions about where else to go or what else to do in regards to protecting from large-scale volumetric DDoS attacks. Pretty much every CDN provider nowadays has stacked up enough capacity to tank these kind of attacks, good luck trying to combat these yourself these days?
Somehow KiwiFarms figured it out with their own "KiwiFlare" DDOS mitigation. Unfortunately, all of the other Cloudflare-like services seem exceptionally shady, will be less reliable than Cloudflare, and probably share data with foreign intelligence services I have even less trust for than the ones Cloudflare possibly shares them with.
Unfortunately Anubis doesn't help where my pipe to the internet isn't fat enough to just eat up all the bandwidth that the attacker has available. Renting tens of terabits of capacity isn't cheap and DDoS attacks nowadays are in the scale of that. BunnyCDN's DDoS protection is unfortunately too basic to filter out anything that's ever so slightly more sophisticated. Cloudflare's flexibility in terms of custom rulesets and their global pre-trained rulesets (based on attacks they've seen in the past) is imo just unbeatable at this time.
The Bunny Shield is quite similar to the Cloudflare setup. Maybe not 100% overlap of features but unless you’re Twitter or Facebook, it’s probably enough.
I think at the very least, one should plan the ability to switch to an alternative when your main choice fails… which together with AWS and GitHub is a weekly event now.
Why do people on a technical website suggest this? It's literally the same snake oil as Cloudflare. Both have an endgame of total web DRM; they want to make sure users "aren't bots". Each time the DRM is cracked, they will increase its complexity of the "verifier". You will be running arbitrary code in your big 4 browser to ensure you're running a certified big 4 browser, with 10 trillion man hours of development, on an certified OS.
And if you do rule based blocking they just change their approach. I am constantly blocking big corps these days, barely any work with normal bad actors.
What do they even have an spider for? I never saw any actual traffic with source Facebook. I don't understand either, but it's their official IPs, their official bot headers and it behaves exactly like someone who wants my sites down.
Does it make sense? Nah, but is it part of the weird reality we live in. Looks like it
I have no way of contacting Facebook. All I can do is keep complaining on hackernews whenever the topic arrises.
Edit:// Oh and I see the same with Azure, however there I have no list of IPs to verify it's official just because it looks like it.
5 9's is like 7 minutes a year. They are breaking SLAs and impacting services people depend on
Tbh though this is sort of all the other companies fault, "everyone" uses aws and cf and so others follow. now not only are all your chicks in one basket, so is everyone elses. When the basket inevitably falls into a lake....
Providers need to be more aware of their global impact in outages, and customers need to be more diverse in their spread.
These kinds of outages continue to happen and continue to impact 50+% of the internet, yes, they know they have that power, but they dont treat changes as such, so no, they arent aware. Awareness would imply more care in operations like code changes and deployments.
Outages happen, code changes occur; but you can do a lot to prevent these things on a large scale, and they simply dont.
Where is the A/B deployment, preventing a full outage? What about internally, where was the validation before the change, was the testing run against a prodlike environment or something that once resembled prod but hasnt forever?
They could absolutely mitigate impacting the entire global infra in multiple ways, and havent, despite their many outages.
They are aware. They don't want to pay the cost benefit tradeoff. Education won't help - this is a very heavily argued tradeoff in every large software company.
I do think this is tenable as long as these services are reliable. Even though there have been some outages I would argue that they’re incredibly reliable at this point. If though this ever changes the costs to move to a competitor won’t be as simple as pushing a repository elsewhere, especially for AWS. I think that’s where some of the potential danger lies.
> and judging by the HN post age, we're now past minute 60 of this incident.
Huh? It's been back up during most of this time. It was up and then briefly went back down again but it's been up for a while now. Total downtime was closer to 30 minutes
If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.