Cloudflare outage on July 17, 2020

QuentinM · on July 18, 2020

Head of DevOps at a major financial exchange where latency & resiliency is at the heart of our business, and yes, we pay Cloudflare millions. I see two things here:

# Just be ready

Most definitely not the first time Cloudflare has had trouble, just like any other system: it will fail eventually. If you're complaining about the outage, ask yourself the question: why were not you prepared for this eventuality?

Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

# Management practices

Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

If you look at the power rails of serious data centers out there, you will quickly notice that those systems, although built 3x for the purpose of still being redundant during maintenance periods, are heavily safeguarded and automated. While technicians often have to replace power elements, the maintenance access is highly restricted with unsafe functions tiered behind physical restrictions. An example of a common function that's safeguarded is the automatic denial of an input command that would shift electrical load onto lines beyond their designed capacity - which could happen by mistake if the technician made a bad assumption (e.g. load sharing line is up while it's down) or if the assumption became violated since last check (e.g. load sharing line was up when checked, became down at a later time - milliseconds before the input even).

user5994461 · on July 18, 2020

>>> Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer.

Which can't be done because it invalids the point of using CloudFlare!

CloudFlare is used to protect your site from DDoS attacks and ransoms. It has to hide the IPs of the servers otherwise attackers will DDoS the servers directly, bypassing CloudFlare.

dorfsmay · on July 18, 2020

Then you use > 1 CDN and switch traffic away from the faulty one. Also, if you serve a very large amount (as in different, say tens of thousands of different images) of data, 1 % of traffic is not enough to keep the other CDN caches warm.

I know of at least one site that works this way which has allowed them to weather CDN outages.

almost_usual · on July 18, 2020

This isn’t an uncommon setup for large corporations.

godzillabrennus · on July 18, 2020

You can use cloudfront or another cloud WAF service as your alt DNS. I think Akamai has a solid one if you can afford it.

phyzome · on July 18, 2020

You can also use CloudFlare for caching, or as a WAF, which you might not care about during a relatively short CloudFlare outage.

deathanatos · on July 18, 2020

> and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely.

Except if you're using CF for DNS service, this wouldn't have worked, as both CF's website & DNS servers were impacted by the outage.

txcwpalpha · on July 18, 2020

That can't be possible, CF's website explicitly says that their DNS is "always available" with "unparalleled redundancy and 100% uptime"! ;)

In all seriousness, I wonder if they are going to have to change the marketing on the site now...

https://www.cloudflare.com/dns/

hunter2_ · on July 18, 2020

Somewhere, someone is saying the marketing is fine since 100 has just 1 significant figure and therefore two nines -- no, one and a half nines -- can safely be written as 100.

esperent · on July 18, 2020

If they are truthful it will now probably change to 99.997% uptime or similar. I expect that's still good compared to many DNS providers.

belorn · on July 18, 2020

Some dns provider only have one name server which is naturally bad.

However it is a bit bad of CF that a single configuration error can bring all the slave servers down. It mean that they have no redundancy in term of BGP misstakes. Customers of CF that want to avoid this would benefit to add an additional slave server outside the hand of CF.

Zonemaster (dns sanity checking tool) actually complain about CF hosted domain names because of the lack of AS redundancy. The outage yesterday demonstrate nicely why that is an concern and why one should care. https://zonemaster.iis.se/?resultid=7d1fab165987e195

jontro · on July 18, 2020

Same goes for route53 too unfortunately

QuentinM · on July 18, 2020

Yeap, that's specifically what I also implicitly meant by "Spread your name servers" (besides having them distributed). To use this technique, you also must have a "Business" account with Cloudflare ($200/mo), so to leverage their ability to front your websites using CNAMEs : )

deathanatos · on July 18, 2020

Ah, I see.

This got me Googling, and best as I can tell, CF doesn't support zone transfers. (They support being a client, but not a server. So, they could function as one's secondary system, but not as the primary.)

stedaniels · on July 18, 2020

I imagine/hope a lot of the heavily engineered ops teams don’t use zone transfers, instead using APIs. The idea of primary and secondary wouldn’t really a strong indicator at that scale.

deathanatos · on July 18, 2020

Sure, I could use the CF APIs… but to do what? AFAICT with a quick look over the documentation, there isn't any way to tail a log of changes being made to the zone. (You can export the entire thing in BIND format, though, the example in the docs has several errors in it that make me wonder how well that would work.) (The idea with zone xfers is that it is at least semi-standardized, whereas CF's API, while useful, is not.)

Then I'm stuck with a bunch of bad questions about how often to poll, and whether CF's rate limits would support any reasonably quick poll interval.

(The big problem is that we have other tooling that relies on being able to update DNS, the big one being ACME for certificate renewal. The changes it makes to CF would need to be rapidly replicated out to the nameserver.)

(Nothing in the post really strikes me as particular to CF, either. I think I could easily replace everything I've said here with "Route 53" and end up in the same bucket, maybe plus or minus zone xfers working.)

foobarbazetc · on July 18, 2020

Using CF for DNS is, IMHO, a bad idea in general, especially for large sites.

We use AWS + Azure + GCP (yes, all 3) as our authoritative NS and keep them all in sync with octodns.

neillyons · on July 18, 2020

Ah ha, This probably explains why my internet stopped working for a bit. I had the 1.1.1.1 app from Cloudflare installed on my phone.

RKearney · on July 18, 2020

> It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Cisco terminal

The output is from a Juniper router, not Cisco.

QuentinM · on July 18, 2020

Thank you! Now who's embarrassed? ;-)

manquer · on July 18, 2020

99% and 1% or most Fail Over setups hardly work in practice smoothly unless you have lot of money to invest in teams and hardware and do DR drills constantly and keep standby infrastructure ready to handle full load . It may work in your industry where the infra cost is trivial compared to the risk and money being made. In typical SaSS apps infra is enormous part of the costs, keeping standby ready is not feasible at all.

It is also that typically even in large organizations companies with the money and people, fire drills and DR drills go the same way, it is known there is going to be drill and people react accordingly. Chaos Monkey style testing/drills rarely happen.

I would say building resiliency to your architecture is the key to this. Just like having a single customer > 50% revenue is enormous risk for any business , relying on any single service provider is also enormous risk . In manufacturing it is common to insist on second source for a part, IBM did that to Intel for the PC which is why AMD got into x86.

In this case a proper HA would serve better - minimum of 2 CDN networks always sharing 50% of the load and the have capacity to double if required. If they cannot scale that much then distribute to 3-4 and keep traffic no more than 25-35% per provider , such that a loss one means only additional 10%-20% traffic to the rest.

Also it is important that two service providers should be actually different, if they both depend on the single and the same ISP or backbone to service an area, it is not going to be effective.

The principle should apply across the entire infra Name Servers, CDNs, load balancers, Storage, compute, DBs, Payment g/w and registrars ( use multiple domains example.com example.io each with one registrar).

spenczar5 · on July 18, 2020

> The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

How do you justify the cost? I'm seriously asking - I have had a hard time making this pitch myself, I am curious if you have (recent!) experience with this.

nolok · on July 18, 2020

Ah that one is surprinsingly easy, you justify the cost by facing facts. Did your company lose money during the downtime, and if yes is that sum more than what it would cost to have this redundancy ?

No ? Then the costs are not justified, and while it would be better from a tech perspective it makes no business sense.

Yes ? Well then you spend X to save Y, with Y being greater than X, so it's an easy sell as long you don't start with "cloudflare is never down" (which is not true).

dannyw · on July 18, 2020

I always assume a service (eg Cloudflare, AWS availability zone) will be completely down for a minimum of 30 minutes, once a year.

It’s worked surprisingly well.

hpkuarg · on July 18, 2020

Yeah, that's roughly 99.99% availability, which sounds reasonable for most anything you want to depend on.

avh02 · on July 18, 2020

Not somebody who'd need to make this decision but: I guess it'd depend on the cost to your business of a half hour outage.

throw_m239339 · on July 18, 2020

> Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

If your service does scale at first place, then you don't need Cloudflare most of the time.

QuentinM · on July 18, 2020

faeyanpiraat's point, but also, despite this failure, let's not dismiss the fact that Cloudflare brings unique (i.e. difficult to replicate) features (hence their success) a/ ability to identify threats at a global scale using a massive aggregation of data b/ ability to stop malicious actors close to their sources thanks to their large grid of POPs & their use of the anycast routing model.

Sure, anyone can scale my localized infrastructure for the traffic of 100,000 IP Cameras. Can anyone do it for 10,000,000 pwnd devices? Sure, but it'll likely start not being so practical without multiple POPs. Do I want to hire a dozen network & threat detection engineers to build/maintain that, complicate my processes, and pay for the infrastructure moving forward for a once-in-a-year event? Not really, no.

The way I see it, Cloudflare acts just like an insurance policy. Pay for a fraction of the actual cost, get your back covered, and profit from the expertise when it hits the fan.

dannyw · on July 18, 2020

I used to run a cryptocurrency website. It would get 50-100gbit+ DDoS attacks on a daily basis. This was a number of years ago.

DDoS mitigation providers wanted absolutely absurd amounts. Cloudflare took me on for $200 a month (I had confirmed beforehand). Mitigated all the attacks. All tickets were responded within minutes by network engineers working to mitigate the attack.

faeyanpiraat · on July 18, 2020

Making something scale and scale cost efficiently is two different things.

Rapzid · on July 18, 2020

I would have assumed CF had a simulation of their entire network(including their peers) where changes would be applied and vetted before rolling it out..

spenczar5 · on July 18, 2020

Networking in general is a far less sophisticated world than we might like to hope. You have to deal with quirks of vendor-specific firmware, creaky protocols, and so on, and the culture of networking has been a bit behind some other areas of software in embracing testing in the way you describe.

We'll get there, but it's no surprise CF isn't doing this today; it would put them waaaay ahead of the pack if they did.

lima · on July 18, 2020

Nothing stops you from replicating your backbone network using a bunch of vMX VMs and testing your changes on it.

Would not catch weird firmware quirks in the real hardware, definitely would've caught this fat-finger typo.

spenczar5 · on July 18, 2020

Well, the thing that stops you is the cost of designing, implementing, maintaining, and scaling the replica testbed. On a large network, that would be pretty hard to justify to most organizations, which would see it as very costly with a tough-to-measure upside.

Have you done this before? I'd be interested to hear how those conversations went.

Drdrdrq · on July 18, 2020

For an organization like CF? Yes, I would expect them to have testing and network simulations down to an art.

If I had to guess, I'd say it's because network engineers simply don't need / get this know-how on normal scale. Most SW developers on the other hand are not very good (good enough for CF scale) at networking. Which leads to networking guys doing their thing the way it was always done... (hope I didn't offend anyone, just guessing)

I hope they strengthen their dev department... I know I'd love a challenge like that. :)

dannyw · on July 18, 2020

This is understandable for most organisations but not networking centric businesses like Cloudflare.

kj4ips · on July 18, 2020

I have to wonder if it would have. Unless you have some kind of route visibility collecting took, or a bunch of simulated traffic sufficient to pop the CPU on the vMX that represented atl01, it would all appear to work. I wonder if you could generate traffic, and scrape snmp counters as a proxy?

Or some kind of tool that processes the resultant routing tables to generate some kind of "route usage" for every given link and device, maybe even feed it with a table of expected traffic to given destinations.

azinman2 · on July 18, 2020

I kind of would? If you’re running a private backbone with these number of PoPs, wouldn’t things be more sophisticated?

jeffrallen · on July 18, 2020

That's a nice theory, but the majority of power disruptions I've ever faced in data centers came from planned work on UPSs that went bad. If you want the quickly lower the reliability of a system, put it on a UPS.

developer2 · on July 19, 2020

>> we pay Cloudflare millions

>> just like any other system: it will fail eventually

An analogy combining these two points: one could pay a million dollars AN HOUR to the top software engineers alive on the entire planet… and at some point you will encounter failure. Technology and humans are both fallible, end of story. This is why SLAs exist with specific uptime targets to meet, and reimbursements should that SLA be broken. Anyone who believes the rare outage is unacceptable: fine; bring that layer in-house and pay engineers millions a year to do the best they can. You'll still encounter failures–and likely more of them.

The usual cry you'll hear from some business is "we lost $x million during the downtime!". Yes, and without some company like Cloudflare in front of your business, you'd probably be losing $x million multiplied by orders of magnitude you don't even want to imagine.

"You can't have your cake and eat it too."

MaxBarraclough · on July 18, 2020

> This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.

Are you thinking of a cloud-computing context here? Seems to me a lot hinges on this, but perhaps I'm misunderstanding you.

If so, this would answer the scale question, and would presumably translate into increased prices until the incident is over. (I'm assuming CloudFlare offer a cheaper solution than doing it yourself on a cloud.)

If not, and you own the physical capacity yourself, wouldn't you do away with CloudFlare entirely?

chronid · on July 18, 2020

> If not, and you own the physical capacity yourself, wouldn't you do away with CloudFlare entirely?

Cost could be an issue. We had something similar (not in the same context) in a company I worked for before. We could shift traffic, but that would cost 2-3x more, so it was not the preferred path unless we had problems.

It surprises me that many (big) companies did not learn the lesson already. We had a similar thing happening already years ago with dyn in 2016 (https://en.wikipedia.org/wiki/2016_Dyn_cyberattack), and it was surprising how many companies relied on a single DNS provider.

MaxBarraclough · on July 18, 2020

Interesting, thanks. Didn't expect a CDN to win out on price against in-house capacity.

Presumably this is a function of scale? At a certain point it's going to be worth running your own CDN.

jefftk · on July 18, 2020

Running your own CDN that is competitive with cloud flare and the other top CD ends requires dozens to hundreds of edge servers distributed around the world, close to your customers. This is very expensive, and while it does make sense for the very largest companies, almost everyone else is going to do better paying for a piece of large-scale shared infrastructure.

kj4ips · on July 18, 2020

> # Management practices

>

> Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

We don't know if this was entirely the case, based on the timeline for the initial incident that prompted the change gone awry, there very well could have been an ITIL-Style CR created and processed within this time.

Judging by the edits made, this wasn't just simply taking a POP out of service entirely, but reducing the amount of (or eliminating all of the) traffic from neighboring POPs sent to compute at the ATL location. I can't image that this exact type of change is all that common. BGP anycast actually makes things significantly more complicated when removing edges.

As far as the mechanics go, with junos's CLI, there's not a lot of difference between what the intended command would have been, and the one that actually happened.

---

What they probably wanted

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

What might have happened

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL from prefix-list 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

Initially, this seems like quite a bit of difference, however, Junos has a hyperactive autocomplete that triggers on spaces,. that deactivate could have been as short as "dea ter 6 fr p 6"

I'm not aware of any routing simulation product that is able to simulate complex bgp interactions, and report on effective routes of simulated traffic, as well as CPU load predictions. The closest I am aware of is running GNS3 (or a bunch of VM routers) overnight and capturing SNMP.

On the other hand, automating these kinds of changes would seem trivial, but such a service would have to be as fault tolerant as any other project, but is most certainly a worthwhile endeavor especially since integration is actually relatively easy, Junos provides some nice REST and XML APIs on the management interface that can do pretty much everything the CLI can, except start a shell.

spenczar5 · on July 18, 2020

Thanks for the detail. There are a lot of people in here who are saying "why didn't they just test their changes before applying them?" and I don't think they really understand how hard that is and how rarely it's done.

manquer · on July 18, 2020

Peer review should always be possible, perhaps CF already does it and it got slipped in the review, reviews only reduce errors and not eliminate them.

It is difficult to write automation to cover all the tasks you would do, even if you cover the ones most commonly done, you will have higher risk on the rest.

A linter or higher level instruction set which well tested may be better solution perhaps. Automation if any perhaps should be after that ?

Steve-_-Jobs · on July 19, 2020

I think the mistake could be assuming that empty "from" statement would not match any routes while in reality deactivating everything inside "from" statement removes it altogether and makes the term match all routes which is indeed somewhat unexpected.

halbritt · on July 18, 2020

> Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

Nailed it.

redm · on July 18, 2020

CloudFlare is a good company and everyone has outages. IMHO the post-mortems they post are not only some of the best I've read from a big company, but they are produced quickly.

I only wish they could update cloudflarestatus.com more quickly. Shouldn't there be some mechanism to update that immediately when there is an incident? When the entire internet knows your down and your status page says All Systems GO! it looks very poorly on them.

aeyes · on July 18, 2020

Their comment regarding the late update here: https://news.ycombinator.com/item?id=23878496

> Because of the way we securely connect to StatusPage.io from most locations where our team is based. The traffic got blackholed in ATL, keeping us from updating it.

nostrebored · on July 18, 2020

Circular dependencies like this are not a good look when the core of your business is networking...

aeyes · on July 18, 2020

Yep, I was joking with our rep that them using Cloudflare Access for their internal services sounds like a problem waiting to happen.

Guess I wasn't wrong, they might even have lost access to internal monitoring systems which is pretty unfortunate in such a situation. If you ask them about Cloudflare Access they will happily tell you that it was built for internal tool access and that they use it for everything, later they went on to sell it as a product.

kochthesecond · on July 18, 2020

When Google Cloud went down a few years ago, they were unable to access internal monitoring because the bad bgp change overloaded their networks if i remember correctly.

dylan604 · on July 18, 2020

An example of dogfooding your own product is actually a bad idea?

microcolonel · on July 18, 2020

While they could dogfood the product, status monitoring systems should be separated from your bread-and-butter product's failures. If you are in the business of messing with BGP, then the BGP that controls the routes that let you report outages should not be the same one you are messing with regularly, or at least, there should be redundancy.

crest · on July 18, 2020

They could add a deadman switch to their status page.

ithkuil · on July 18, 2020

A meta status page

marksomnian · on July 18, 2020

Soon you'll need status pages for the status pages.

fernandotakai · on July 18, 2020

already exists ;)

https://metastatuspage.com/

(it's statuspage.io's status page)

earthboundkid · on July 18, 2020

Isn’t it status page 101 that you host it in the competition, so it stays up when you’re down? :-/

eastdakota · on July 18, 2020

Here's our blog post detailing what happened today and the mitigations we've put in place to ensure it doesn't happen again in the future: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...

txcwpalpha · on July 18, 2020

I mean this in only a slightly judgemental way: What kind of change management/testing is going on over there at CF? This is not the first time that someone at CF has made a hasty config change that brought down a significant part of the network.

Now I'll admit that proper change management won't catch every issue, but the issue described in this post seems like something that should have been caught. It's a little worrisome that a company that so much of the internet relies on is apparently playing fast and loose with major config changes. The changes you describe in the post-mortem sound like they will fix the immediate problem and possibly prevent future occurrences of this exact same problem, but what about making broader changes about how you deploy these things?

Not only that, but this single config change not only brought down the CDN, but brought down both 1.1.1.1 and the secondary 1.0.0.1. Was this type of failure never tested, or...? What's the point in having both if they both go down at the same time?

harikb · on July 18, 2020

I would assume CF operations folks would be constantly making changes that affect “production” many times a day. Some of it will be to be counteract an issue that originally started because of an external issue like they mention here.

In an ideal world, yes there would be higher level tools to do everything and checks and balances for everything, but I assume half the tools are like our standard Unix cli tools - all too powerful and the operator better know what they are doing

By the time they have all the right tools for everything, the company would have become a dinosaur disrupted by some other technology or company

txcwpalpha · on July 18, 2020

>I would assume CF operations folks would be constantly making changes that affect “production” many times a day. Some of it will be to be counteract an issue that originally started because of an external issue like they mention here.

Even in an "emergency" situation like described in the post-mortem (which I'm actually confused about, because Prince's tweet said it was actually "routine maintenance", which makes it even less justifiable), there should be a standard playbook with pre-tested and pre-approved failover options to enact. There definitely should not be someone just manually fingering in a config file on the fly.

>In an ideal world, yes there would be higher level tools to do everything and checks and balances for everything, but I assume half the tools are like our standard Unix cli tools - all too powerful and the operator better know what they are doing

This is fine for some startup still hacking their way through an MVP, but for a global, multi-billion dollar company that prides themselves on supporting a significant part of the global internet, "the operator better know what they are doing" is unacceptable and not good enough. There should be multiple layers of protection against this stuff, so that even if an operator is having a nutty, they don't break half the internet.

>By the time they have all the right tools for everything, the company would have become a dinosaur disrupted by some other technology or company

I can't agree at all. We're talking about some basic change management and testing stuff here. I've been at plenty of organizations (even ones much smaller and with less resources than CF) that have gotten this right, and they certainly are not dinosaurs nor have they been "disrupted". Testing and change management is as fundamental as encryption or password hashing. If you don't have it, it's not because it's hard, it's just because you haven't bothered to implement it.

harikb · on July 18, 2020

For the record, I don’t work for cloudflare. I was merely pointing out that failsafe and bulletproof tools for every situation is a myth in my humble opinion. If you have seen and work for organizations that know how to do this better, good for you.

txcwpalpha · on July 18, 2020

>If you have seen and work for organizations that know how to do this better, good for you.

Let's be clear: I'm not talking about some small subset of fantasy companies. It is a strict compliance requirement for many of the world's largest industries that you must have things like change management processes, incident response playbooks, and testing in place. A "myth" it certainly is not. Again, it's a fundamental thing no different than the most basic security requirements that we also expect (which is no coincidence, as one of the components of many security frameworks is proper change management controls). I'm not saying it has to be bulletproof and perfect (in fact, it never will be), but you still have to have something.

Put another, possible more easily understood way: "don't test in production" is a requirement (some times even a legal requirement) for most companies once they reach a certain size. It's not just a funny saying that devs joke about.

closeparen · on July 18, 2020

Compliance can make you do something that looks like code review, something that looks like testing, something that looks like runbooks, etc.

That can't will into existence engineers who are actually thoughtful, creative, and skilled at those things. They can't make the test suite actually anticipate the bugs that will be written. They can't make the runbook have correct steps with no unwanted side effects for every failure mode. They can't make the reviewer see what the author doesn't.

Two, maybe three people in my 100-person org are legitimately good at this stuff. No amount of grandstanding about what a serious business we are changes that. In fact the people most enthusiastic about compliance and process usually end up undermining the real, substantive technical quality that the compliance process was aiming at.

txcwpalpha · on July 18, 2020

Absolutely. I'll be the first to say that compliance doesn't mean you're actually good at those things (and I actually have said as much here on HN).

However, the point remains that things like change management aren't some nebulous, pie-in-the-sky concept. That they are in compliance frameworks speaks to the fact that they are very fundamental and basic (compliance frameworks are typically the bare minimum of things you should be doing).

A half-assed CM process won't catch all errors by any means. But it will catch the most basic ones. And according to CF's post-mortem, the initial issue that kicked off the chain of events was "backbone congestion", which has got to be one of the most basic things that would be included in an IR playbook (Prince even referred to it as "routine maintenance" in an earlier message). And then the config change that was put in place also seems like a fairly basic change that probably would have been caught with with some basic testing.

That's what catches my eye the most. This incident wasn't some wild, niche, couldn't-have-been-predicted event. It was "backbone congestion", followed by an attempted change of routes on the backbone to alleviate the congestion. For an internet services company to not have a standard, pre-approved and pre-tested solution for resolving something as predictable as backbone congestion is shocking to me.

manquer · on July 18, 2020

You are right if the facts are exactly as presented and there is nothing more to know.

I would say we should ready between the lines here. Most postmortems are marketing artefacts to some sense designed to reassure customers so taking them at face value is not really good idea, I have written and read enough to know that it is rarely ever the full story for variety of reasons.

It is quite possible like you say they are playing fast and loose with config and they have systemic process risks as you point out.

It is also possible that the issue is not just routine congestion i.e. it is symptom and not the root cause, and they are not talking about it in detail either to make the communication simple enough for non networking experts to understand, but has enough depth for technical folks to kind of follow along and reassure their managements that cloudflare has handle on it . It could also be revealing more may give their competitors some crucial information about their IP/architecture and they are calling it routine congestion to not reveal that.

Lastly, all of the above are not mutually exclusive, it could be all of them in parts too.

nostrebored · on July 18, 2020

Attitudes like this are the reason that we don't have better practice industry wide. Seeing quality control as the enemy of success doesn't create good software, it creates technical debt and discourages prospective ops engineers.

The teams I've seen be the most comfortable with these knowably bad processes have been the strongest advocates against guardrails. The cost to your reputation and bottom line when something does go wrong is opaque to many engineers, but these kinds of ridiculous errors do lose customers.

Edit: A common saying where I work is that there are no stupid mistakes only bad processes. The first step to rooting them out is admitting that they're process level issues.

closeparen · on July 20, 2020

My problem with process/policy is that it tends to attract a) bureaucrats, and b) the kind of engineer who is not particularly interested in or good at building products but still wants to ladder-climb.

These are not the people I want dictating my workflow.

That doesn't mean it's the "enemy of success." If we could raise its status to attract people with genuine technical competence and empathy for the realities of building, we could do a lot better.

>Edit: A common saying where I work is that there are no stupid mistakes only bad processes. The first step to rooting them out is admitting that they're process level issues.

Human stupidity is unbounded; such a process asymptotically approaches "do not change production ever." If you're a monopoly with a money printer, then keeping it running probably is more important than any improvement you could make. If you're under competitive pressure, you need to ship, and shipping necessarily involves risk.

ATsch · on July 18, 2020

Indeed, mitugating DDOS effectively requires frequently making drastic changes to your network on short order.

jonny_eh · on July 18, 2020

And have both the primary and secondary DNS go down?

dylan604 · on July 18, 2020

Can you name one large service/platform provider where this hasn't happened? We know the likes of Amazon, Google, Microsoft, et al have had similarly idiotic causes to go down. I'd either give them all of my business or start betting against them as their day is due.

rohansingh · on July 18, 2020

What makes you think that whoever would be responsible for approving the change would somehow know better and catch the issue?

teraflop · on July 18, 2020

The fact that such a critical problem can only be detected by a human noticing it is, itself, part of the problem. Having multiple people look at the change instead of just one is a little bit better, but it's still just a band-aid. Ideally, there would be automated processes in place that could prevent and/or mitigate this kind of thing. (If we were talking about software instead of configuration, how would you view an organization that required every commit to be reviewed, but had no automated tests?)

One possibility would be to parse the config file and do a rudimentary simulation, taking into account traffic volumes, and warn the user if the result would be expected to overload any routes.

Another possibility would be to do something a bit smarter than just instantly, blindly propagating a change to every peer at the same time. If the bad routes had been incrementally rolled out to other routers, there might have been time for someone to notice that the metrics looked abnormal before Atlanta became overloaded. (I don't know whether that's feasible with BGP, or if it would require a smarter protocol, but it seems like it would be worth looking into.)

Finally, it seems like if a config change is made to a router and it immediately crashes, there should be systems that help to correlate those two events, so that it doesn't take half an hour to identify and revert the change.

RKearney · on July 18, 2020

It’s probably easy to say this as an outside party, but anyone with the most basic understanding of Junos would have known that by deactivating all of the entries in the from section of a policy or policy term, the accompanying from section would apply to all routes.

txcwpalpha · on July 18, 2020

Ensuring that the appropriately-knowledgeable people for various types of changes, as well as appropriate testing processes, are part of the change management workflow is a part of good change management.

dang · on July 18, 2020

Ok, we've changed to that from https://www.cloudflarestatus.com/incidents/b888fyhbygb8.

tomklein · on July 18, 2020

I love your transparent communication and how quickly you guys respond with an explanation of what actually went wrong and how to mitigate these issues in the future. Such things happen.

tomklein · on July 18, 2020

Addition: CF just sent an email out to customers about the incident.

heliodor · on July 18, 2020

I beg to differ!

They wrote a two-page essay and buried in the middle is the one-sentence root cause of "someone changed something and made a mistake". No explanation or details given.

This is exactly how to pretend you have incident reports without actually having them.

Luckily, the HN crowd is able to piece it together and explain in detail.

paco3346 · on July 18, 2020

But... they literally showed the config change and said what was wrong with it.

dancemethis · on July 18, 2020

[flagged]

andrewnicolalde · on July 18, 2020

Many of us would appreciate some proof of Discord being malware

wenbin · on July 18, 2020

A similar outage happened in July last year: https://blog.cloudflare.com/details-of-the-cloudflare-outage...

almost_usual · on July 18, 2020

How is a WAF regex similar to a route configuration error?

lima · on July 18, 2020

Similar lack of peer review and change management process.

michalhosna · on July 18, 2020

Nope, not at all. Read the damn thing. They tested it, the thing was correct, it overloded cpu, it wasn’t configuration mistake.

lima · on July 18, 2020

I read the damn thing:

- No performance benchmarks for rules despite using a backtracking regexp engine.

- No canary deployments, allowing the mistake to propagate to their entire network at once.

A mature engineering organization is very unlikely to make such basic mistakes, because it has in-depth peer review for both design and implementation.

Of course, their openness about the incidents is commendable and is the only reason why my company hasn't dropped them a long time ago, but it doesn't excuse such an easily avoidable downtime.

bawolff · on July 18, 2020

I dont think these are all that similar - different components, different types of overloads

katzgrau · on July 18, 2020

I was on a call with an investor and an employee mouthed to me, silently, "everything is down!"

Immediate hot flash.

After I got off the call (thank god he had an appointment to run to), I checked it out. Our internal dashboards were all green so we realized it was a DNS issue pretty quickly.

Since we couldn't get into Cloudflare we searched Twitter and realized it was their issue and I stopped worrying.

One of the benefits of CF and other major internet vendors is that when they're down, you can kind of shrug it off and tell your customers to wait a bit. Not so if you're using a smaller/unknown CDN company.

dweekly · on July 18, 2020

As a famous haiku once said:

It's not DNS

There's no way it's DNS

It was DNS

DangerousPie · on July 18, 2020

Honestly I think this is an underappreciated aspect of the dominant role of Cloudflare. If we were using a smaller CDN then our users would blame us every time it went down. But because we're using Cloudflare our users will quickly notice that sites much bigger than us are down as well and just think "eh, the internet is being weird right now".

paul_f · on July 18, 2020

Twitter has become the go-to dashboard to discover if there is a broad based outage when our servers are inaccessible. In the last month, IBM data centers and now Cloudflare

tokumei · on July 18, 2020

Anytime Cloudflare is experiencing issues, I immediately stop what I’m doing and direct customers to the Cloudflare status page and take a nice break.

jzawodn · on July 18, 2020

Why did you stop worrying?

ksec · on July 18, 2020

You have a bigger name to blame. It is much easier to tell customer Amazon is down, or Cloudflare is down, rather than explaining something that customer dont want to hear.

And if everyone else on the AWS / CF is down as well, then it is no one's fault. We all keep calm and just wait it out.

LaserToy · on July 18, 2020

Your customer doesn’t care, as they don’t have business with cloudflare. Your fault is in not having plan b.

zaroth · on July 18, 2020

Wish it were so simple, but alas, fair or not, this is a real and prevalent bias in the world.

I’ve had colo equipment that ran with 5-nine uptime for years eventually get unlucky and be down for an hour and it was “all my fault”. Switch the service to Amazon which achieves much worse uptime, but now it’s “well if Amazon is down, what can you do?”

Frankly when something seen as core internet infrastructure goes down, the measly SaaS companies pretty reasonably don’t take any blame.

To be clear, often these are services where there isn’t the engineering budget, nor honestly the need, for multi-cloud and geographically distributed multi-master services.

LaserToy · on July 18, 2020

Agree, but it comes across as we don’t care. I worked for a major gaming network, and believe me, gamers don’t care why they can’t play. They want to play. Now. And I can’t imagine anyone on the team relaxing when something like that happened.

katzgrau · on July 18, 2020

Yeah, luckily our customers are a little easier to deal with than that, despite the very real impacts to their businesses. You have to know/understand your customers in order to know when you can relax.

But the plan B in this case is switching nameservers, which we could certainly have done (and briefly considered), but it could be error prone and would take longer for those changes to propagate than it would for CF to fix the issue, most likely.

The best option was simply inaction, if that makes sense. There are times when not doing something is a better idea than doing something.

mrkurt · on July 18, 2020

They're right, though. Some companies are large enough to give you credibility cover during an outage.

It's the same as the old phrase about no one ever getting fired for buying IBM.

ksec · on July 18, 2020

Yes and these sort of excuses applies to ALL industries. Any Sales people or Support Staff should already know this. So I am often surprised people are arguing about it.

At the end, your customer, assuming it is a SaaS, just need an excuse for "their" customers or the End User. And there is nothing better than a big household name's fault so no one get the blame.

So the whole thing is written off, everybody is in the clear, and everyone can get on with their other business. :)

syshum · on July 18, 2020

Well many people have the ability to see through marketing BS, and these types of excuses are exactly that, marketing BS

I often feel sad for people that have not developed the critical thinking skills needed to look past marketing to where the realm of logic and reason resides

untog · on July 18, 2020

Patronising much?

The simple reality is that if your site is down because AWS is down then it means a lot of other sites are down too. Which means you don’t look anywhere near as bad as you would if you were the only site not working.

Logic and reason is all well and good but human perception is a very real thing that businesses need to keep in mind. It isn’t always entirely logical.

syshum · on July 18, 2020

I would like to know of a time where all of AWS was down, every region, every data-center, everything down

At most one region was impacted, and it is easy to be multi-regional in AWS, in fact that is kinda of the point

Further this comment is in service of the moronic axiom of "No one ever got fired for buying <<insert large company>>" my response to that has always been and will always be "sure they have and they should"

It is simply not true that buying AWS, IBM, Cisco, or any other large vendor is complete insulator from all responsibility to maintain reliable systems nor should it be

Any administrator or developer that is making buying choices based on that is not an person I would like to ever work or do business with

untog · on July 19, 2020

“AWS is down” is phrasing shorthand, it doesn’t mean every single service they offer is nonfunctional. It means that for the vast majority of people it is not doing what it is supposed to do. Again, it’s not a technically correct statement but it is how humans talk.

To extrapolate out to the original point: if multiple top ten traffic web sites are having issues (and this has happened once or twice in the last few years due to AWS or other cloud issues) then your site being down is less notable in customers minds.

syshum · on July 19, 2020

again that depends on who your "customers" are

The grandparent was talking about a SaaS service so then I would expect the customers to be technically minded people.

If you are selling to masses then sure, but if you are selling me a Line of Business SaaS service then no I do not care if facebook and reddit was down, that is not relevant to how the SaaS product should be running

ksec · on July 19, 2020

I think that is a narrow definition of SaaS. But I understand your point.

For instance Basecamp could be down due to AWS. And Basecamp ( and in this case Hey as well ), but are SaaS. And their customer may not always be technically minded. Given the usage of these tools, while inside tech circle, are not only used by technical people.

And At the end of the day it is all about trade offs.

I also dont think anyone ever make a purchase decision purely on brand or not fired for X. For example, despite AMD offering lower price and offer more core and performance. Server Vendors hasn't all switched to AMD at once. In fact, Intel Server still has months of backlog order to fill. This isn't simply because Intel is better connected with Vendors, it is the fact most of those End user / customers are still demanding Intel CPU. Because it is well tested, with more specific libraries, tools, guarantees and support. Many of these factors cant be quantitatively measured, and therefore would only be judged when the final price difference are shown. In this case No one gets fire for using X is another phase for if it aren't broke, dont fix it.

syshum · on July 19, 2020

I believe Xeon vs Epyc is more to do with the fact that migration from Xeon to Epyc is a disruptive process both at the hypervisor and guest level. Server hard ware changes today is

Install new host -> Migrate VM;s live to new host -> Shutdown old host

Zero Downtime

This is simply not possible with a Xeon to Epyc Migration which requires downtime, as well as testing of the guest to ensure nothing weird happens

it is a prime example of Vendor Lockin.

IF there was away to live migrate with zero downtime Epyc would own the datacenter today

peteretep · on July 18, 2020

I was CTO at a SaaS company with a whole host of companies you've heard of relying on us as a core part of their business. They absolutely would and did accept "AWS was down, soz" as an excuse, where they were much spikier about fuckups where we were to blame.

Tangential: nobody got fired for buying IBM

LaserToy · on July 18, 2020

I was in eng leadership at largest gaming network. Our customers (gamers) absolutely didn’t care about reasons they couldn’t play. It was always our fault as we were responsible for tech choices, not them. And I believe it is a way. If I’m selling you a service it is my responsibility to make sure it is reliable and not to blame it on something that is under my control (like cloud providers). Of course I can’t fix local internet going down, but I can make sure I’m not married to 1 vendor

katzgrau · on July 18, 2020

I'd argue that you're probably holding yourself to a standard that is more/less unachievable in such an interdependent world. It's idealistic and idealism is a square peg in the funny shaped hole of reality.

Taking accountability and having backup plans are extremely important, but you simply can't remove every last shred of dependence. You eventually have to accept that there are things that are out of your control and may take you by surprise despite best efforts.

zzzcpan · on July 18, 2020

In web and online infrastructure pretty much nothing is out of your control except for two things: ISPs people use and domain name registrar you use for your domain name. And even domain name registrar centralization can be mitigated against by having multiple domains from multiple registrars and promoting different domains to different users and having backup communication channels to inform users about new domains in case something happens.

Other than that it's your choice whether to make your infrastructure dependent on a bunch of unreliable centralized SPOFs from big corporations or build highly available infrastructure relying on servers from many different providers running your own DNS servers with DNS routing, failover, etc. You will definitely beat Cloudflare's availability this way many times over.

katzgrau · on July 18, 2020

And you will still be exposed to being blindsided by something out of your control. It's really only in your control of you can think of and plan for it ahead of time. And there will certainly be things that we don't consider. You can call that a failure but it happens all the time and it's reality.

What if a political event impacts you, for instance? A pandemic? A storm taking out a major data center? A weird Linux kernel edge case that only happens beyond a certain point in time? That only sounds ridiculous because it hasn't happened, but weird things like that happen all the time. There are so many unseen possibilities.

I understand that might sound unreasonable or facetious or like I'm expanding the scope.

The point is, the more confident that you've built something that has no SPOF the more exposed your are to the risk of it, because one probably does exist.

zzzcpan · on July 18, 2020

Honestly, you are not making any sense. This is not how engineering works. If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world. Furthermore, with resilience you have to always cover all risks, it's just that you don't immediately reach fine granularity of decisions that don't trigger failover to servers in different countries, you improve granularity as you learn from actual operations and modify your designs accordingly.

I remember when I first deployed DNS routed system it was too reactive, constantly jumping between servers, monitoring was too sensitive, it didn't wait for servers to stabilize to return them into the mix and exponential backoff was taking servers out for far too long. But even given all that it was still able to avoid outages caused by data center failures and connectivity problems.

katzgrau · on July 18, 2020

It does make sense, and it's paradoxical, I know.

> If you design for resilience, you get more resilience and you build confidence as you see the evidence how the system works in real world.

You simply can't foresee or eliminate all risk. This is referred to as "the turkey problem." It's not my idea, but one I certainly subscribe to.

https://www.convexresearch.com.br/en/insights/the-turkey-pro...

zzzcpan · on July 18, 2020

The whole idea behind resilience is to cover unforeseeable risks, the turkey problem just doesn't apply here. I would even say if the system doesn't solve the turkey problem it cannot be called resilient. And high availability without resilience is not practically possible.

katzgrau · on July 18, 2020

> The whole idea behind resilience is to cover unforeseeable risks

Speaking of things that don't make sense... if it's unforeseeable, one will have a difficult time adequately preparing for it

zzzcpan · on July 18, 2020

It's not difficult, it's just different. It's the difference between predicting that a truck might crash into a data center and building concrete wall around it, and designing a system in a such way that users only ever resolve to servers that are currently available regardless of what happened to some of them in a data center that had a truck crashed into it.

katzgrau · on July 18, 2020

... and after you've solved for the truck problem, you have a potentially infinite list of other things to plan for, some of which you will not foresee. And of course, there's probably an upper bound on the time you can spend preparing for such things.

Famous to the point of being a cliche, the titanic was thought to be unsinkable, and I would have a similarly hard time convincing the engineers behind the ship's design to believe otherwise.

The level of confidence you're displaying in predicting the unforeseeable is something you may want to take a deeper look at.

zzzcpan · on July 18, 2020

You are missing the point. Solving the truck problem is exactly what you shouldn't do, well, at least until your system is resilient. Because it could be something entirely different, it could be law enforcement raiding a data center and your wall around it won't protect it from them. So instead you approach the system in terms of what it has to rely on and all possible states of the thing it has to rely on. Which maps to a very small number of decisions. Like whether a server is available or not. If it's not available it really doesn't matter which of the infinite things that could happen to it or to a data center it is in actually did, you simply don't return it to users if it's not available and have enough independent servers to return to users in enough independent data centers to achieve specific availability. It's really not difficult.

I understand that most of those leetcode corporations don't care much about resilience, likely even incapable of producing highly reliable systems, and may give you a false impression that reliability is something of an unachievable fantasy. But it's not, it's something we have enough research done on and can do really well today if needed, we are not in titanic era anymore.

I have high confidence in these things (not in "predicting the unforeseeable"), because I've done them myself. My edge infrastructure had like half an hour of downtime total in many years, almost a decade already.

syshum · on July 18, 2020

I am a customer of a SaaS product that use AWS, and I never accept "AWS was down" and should the vendor use that I start looking for new vendors

dalyons · on July 18, 2020

haha good luck with that, the vast vast majority of saas products are not multicloud, and nor should they be (cost / benefit tradeoff)

syshum · on July 18, 2020

they should be multi-regional. If they are only in US-EAST and US-EAST is down well yes I blame them for that.

show me a time where all of AWS was down in every region all at the same time

creatonez · on July 21, 2020

The point is that the customer is likely using a dozen other sites that have also been hit by cloudflare outage, so they care less that your one specific site is knocked down. The customer immediately knows the blame is not on you when they notice how widespread the outage is.

mirimir · on July 18, 2020

Not OP but likely because there was nothing they could do.

LaserToy · on July 18, 2020

Because they don’t care about their users. Last time dns quit on us (and literally half of the internet went down), we ended up developing a backup plan.

cflewis · on July 18, 2020

Compare this to Facebooks SDK “postmortem” and you can tell which company cares more about its customers.

notwhereyouare · on July 18, 2020

They actually did a postmortem?

john-shaffer · on July 18, 2020

I think this is the "postmortem" he's referring to: https://developers.facebook.com/blog/post/2020/07/13/bug-now...

> Last week we made a server-side code change that triggered crashes for some iOS apps using the Facebook SDK. We quickly resolved the issue without requiring action from developers.

tpmx · on July 18, 2020

I'm pretty sure at least half a billion people were affected, and this is the best they can come up with...

LaserToy · on July 18, 2020

Isn’t it a cheap way to win hearts? As a customer I’m not really interested in what happened, I don’t want it to happen.

pikelet · on July 18, 2020

Bad things happen, and I'd rather we foster a culture of honesty rather than pretending.

LaserToy · on July 18, 2020

They can do whatever they want internally, I don’t even know what they said is true.

noahtallen · on July 18, 2020

Mistakes happen at all companies. I’m not sure it’s possible at all to have 100% uptime forever. Since it’s not possible (or at the very least, extremely unlikely) to never make a mistake, isn’t it much better to try to understand each other’s mistakes?

Certainly, cloudflare, could be lying to us. First, this seems super unlikely to me, but in the event that they are, it’s still a situation which would literally cause the described issue if it did happen. Therefore it can still be learned from in the same way. Again, it feels unlikely for it to be a lie considering the specificity and lack of other viable explanations.

LaserToy · on July 18, 2020

There are infinite amount of ways to break something.

Look, most of mistakes are silly or combination of silly. We think it is good to understand them, but in reality someone on the team probably pointed out that this can happen and was ignored, as it wasn’t a priority. And the biggest motivator to make sure companies prioritize uptime is to tell them, that we don’t care about their excuses, we care about uptime.

I wonder whether github still on MySql waiting for another outage.

Read the reason - human error. As old as humans, all I learned is that Cloudflare doesn’t have sufficient automations and checks in place.

heartbreak · on July 18, 2020

> I don’t even know what they said is true.

We can reasonably assume that the postmortem is truthful. Since they’re a publicly traded company, lying about this incident would be a quick way to turn an embarrassment into a felony.

combatentropy · on July 18, 2020

It's seems these major infrastructure outages always are from a configuration change. I remember Google had a catastrophic outage a few years ago, and the postmortem said it all began as a benign configuration update that snowballed worldwide. In fact I tried googling for it and found the postmortem of a more recent outage, also from a configuration change.

Some seasoned sysadmin will say to me, "Of course it's always from a configuration change. What else could it be?" I don't know, it seems like there are other possible causes. But in today's superautomated infrastructures, maybe config files are the last soft spot.

jeffbee · on July 18, 2020

There's always a lot of initiative to get rid of unsafe manual network config changes right after a major outage, but the difficulty of automating such changes is surprisingly high, and the rate at which the initiative decays is also surprisingly high.

clipjokingly · on July 18, 2020

Is it possible to test config changes in a simulated version of the network?

Rapzid · on July 18, 2020

> We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.

I'm not even sure.. Is that second sentence supposed to signal some sort of success? Dropping 50% of your traffic isn't isolated. If your gonna try to spin it, at least bury the damn lede. Further:

> The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.

Locations with THEIR equipment, but certainly not all "affected" locations. I live 4 hours from Dallas and can assure you that I was impacted. That coverage is like.. Most of the United States, Europe, Brazil and who knows how much of South America? Oh right, 50% of their traffic!

innocenat · on July 18, 2020

I read that as "It does not affect Asia". Which, while it's totally a wee hour, match what my monitor is telling me.

dannyw · on July 18, 2020

I appreciate the transparency of the 50% figure other than some generic spin “some connectivity was degraded”.

spenczar5 · on July 18, 2020

Wow, BGP brings down globally-used DNS. It’s like a perfect lesson in weak points of the modern web’s accidental design.

dilyevsky · on July 18, 2020

If you own a prefix and announce some bad routes that cause all your traffic to be blackholed due to misconfiguration on your end i dont see how it’s bgp’s fault.

rexarex · on July 18, 2020

Network Engineering: Invisible when you’re killing it at your job, instantly the enemy when you make a mistake.

dahfizz · on July 18, 2020

We need to do something about BGP.

Just in the past year Verizon, IBM, Apple, now Cloudflare have seen outages from BGP misconfiguration. The Verizon issue took down a significant part of the internet.

BGP is a liability to society. We need something which doesn't constantly cause widespread outages.

tyingq · on July 18, 2020

Any replacement would also need the ability to route traffic, and subject to similar risks. A "pre-push" testing simulator might be easier than throwing out BGP.

Cyph0n · on July 18, 2020

I recall watching a Microsoft talk where they explain how they do exactly this.

elithrar · on July 18, 2020

This was an iBGP issue, not eBGP.

It is entirely possible to cause a similar problem with OSPF or (lol) IS-IS, with the “right” misconfiguration and route metrics.

kortilla · on July 18, 2020

It’s not BGP, it’s the immature tooling around it as far as simulating changes, etc.

Any other tool that allows you to announce connectivity will have the same problem. In this scenario it was still even legitimate routes, it just was too much for their specific link to handle.

leoc · on July 18, 2020

I don't know how much better Juniper's stuff is, let alone Cloudflare's specific setup, but classic Cisco IOS gear's approach to batching, pretesting and merging configuration changes on mission-critical network infrastructure is basically "f*ck it, we'll do it live on the command line". Real '80s stuff, not far removed from "classic" Unix man-with-a-beard-and-a-Telnet-session configuration management (though the command line largely keeps outright syntax errors out of the config, and the (Nearly) One Big Config File makes basic restore-from-backup relatively straightforward.

notajoke · on July 18, 2020

Used to work on a bbone at an ATT competitor, used Juniper in the core. There was a lot of talk and sales too about SDN and the future of networking being all automated, but in the end there was always problems that needed a brain and an ssh session to fix

fach · on July 18, 2020

"BGP is a liability to society" seems a bit polarizing. Any system when told do to stupid stuff by a human via configuration will usually do stupid stuff. The right answer isn't to replace the underlying system.

simonswords82 · on July 18, 2020

In the last three years we have hosted our enterprise software on Azure, and the only outages we've had have been caused by mistakes or issues at Cloudflare. Azure has been rock solid but our customers don't understand that and assume that we're "just down", which impacts our SLAs.

During the most recent outage a few weeks ago, Azure were available to discuss the issue by phone. I wish I could say the same for Cloudflare.

I would be interested to hear from anybody who knows of a good alternative to Cloudflare. I'm completely fed up with them.

coder543 · on July 18, 2020

If you're so happy with Azure, why aren't you using Azure CDN? https://azure.microsoft.com/en-us/services/cdn/

Obviously, AWS and GCP offer their own CDN systems as well. (CloudFront and Cloud CDN, respectively)

There are tons of third party CDNs as well.

Unless by "good alternative" you mean that you're on Cloudflare's free plan, and hoping to find someone else who will willingly soak up huge amounts of bandwidth for free?

Cloudflare is one of the only services I know of that offers this, but it's hard to complain about a short outage every once in awhile when you're paying nothing or very little. The Cloudflare customers who are paying quite a bit are surely upset.

simonswords82 · on July 18, 2020

My head of product is looking at the Azure offerings and how they compare with what we have from Cloudflare as we speak.

No we are not using Cloudflare's free plan.

I was simply interested to know if anybody else had recommendations for a Cloudflare alternative.

bogomipz · on July 18, 2020

>"We are making the following changes:

Introduce a maximum-prefix limit on our backbone BGP sessions - this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.

Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident."

It should be noted that configuring prefix limits for your BGP peers is kind of BGP 101. It's mentioned in every "BGP Best Practices" type document.[1] It's there for exactly this purpose to prevent router meltdown and resource exhaustion. For a company who blows their horn as much as these folks seem to about their network this is embarrassing.

I think it's worth mentioning that it was this time last year when Verizon bungled their own BGP configuration and brought down parts of the internet. When that incident occurred Cloudflare's CEO was front and center excoriating them for accepting routes without basic filtering [2]. This is exact same class of misconfiguration that befell them yesterday.

[1] https://team-cymru.com/community-services/templates/secure-b...

[2] https://twitter.com/eastdakota/status/1143182575680143361?la...

rob-olmos · on July 18, 2020

Commented on the original outage: Hopefully with this outage Cloudflare will provide non-Enterprise plans a CNAME record, allowing us to not use Cloudflare DNS and more quickly bypass Cloudflare if the need arises.

rwiggins · on July 18, 2020

I'm pretty sure CNAME setups are allowed on the Business plan ($200/month). I had to set one up not too long ago.

zamadatix · on July 18, 2020

If Cloudflare DNS is down how would that CNAME resolve? In the case that record is not hosted by Cloudflare DNS then why does it matter if it's a CNAME or not? Sorry, probably not familiar with the offering you're referring to.

toast0 · on July 18, 2020

The idea is you host in your DNS

foo.example.org. CNAME blah.customer.cloudflare.whatever. with a ttl of loke 5 minutes

Then when cloudflare goes down, you switch that to your origin server, or your static system is down page or something. Most of your traffic moves within 5 minutes and when you're satisfied Cloudflare is working again, you move traffic back.

If you've delegated your domain to Cloudflare, you can switch your delegation at your registrar, and a lot of TLDs update their servers pretty quick, but the TTL is usually at least a day, so you'll be waiting a while for traffic to move.

usr1106 · on July 18, 2020

But now you have an additional point of failure. You would need a very reliable DNS provider to host that CNAME. It increases your flexibility when CF has an issue, bit it does not necessarily increase the reliability of your site.

toast0 · on July 18, 2020

Yes, but if all your fancy names are simply cnames, you can use normal zone transfers to copy between servers, and use servers from multiple providers. Most recursive resolvers will retry requests against multiple delegated name servers until they get one that responds (or they all fail to respond). It adds some delay, so you wouldn't want the servers to be down often, but it's tolerable.

stepstop · on July 18, 2020

That makes sense. What is the cost difference between those plans? (I’m not a Cloudflare user, just curious)

jasongill · on July 18, 2020

The self service plans are free, $20, and $200 a month. Enterprise is $2000 a month and up (having recently been quoted on it)

iso947 · on July 18, 2020

2 years ago, about 1 in 3 people in the UK were watching England in the World Cup.

Towards the end of the game, a CDN the bbc used crashed, taking a million people’s live streams offline.

Traditional TV with its 20 million plus viewers worked fine.

A 15 minute global outage during the World Cup or Super Bowl is not acceptable in the world of boring old TV

Meanwhile github has been down how many times this year?

IT is still a terrible industry of cowboys. It’s just hidden under the veneer of abstaction, microservices and outsourcing. Other industries like the national grid or water of radio have outages that affect a local geographic area or a limited number of people, but they are far more distributed than the modern internet. It’s ironic a network designed to survive nuclear war can’t survive a typo.

https://m.huffingtonpost.co.uk/entry/bbc-iplayer-crashes-in-...

maxdo · on July 18, 2020

what is funny you go to cloudflare customers page, check all these companies status page, all down, non of them admits it's due to third party cloud provider e.g. cloudflare. In most cases it was "performance issue". Its so silly... you're in the interconnected world. It's ok your major cloud providers went down...

itsjloh · on July 18, 2020

Why did it take so long for a status page to be published I wonder?

From the timeline in the blog post the issue with Atlanta was fixed between 21:39 to 21:47 but a status page wasn't published until 21:37. Everything had been broken for over 20 minutes at that stage with lots of people already posting about it or other status pages reflecting issues. See https://twitter.com/OhNoItsFusl/status/1284239769548005376 or https://twitter.com/npmstatus/status/1284235702540984321

Without an accurate status page it leaves businesses pointing the finger everywhere wondering whether its their hosting provider having issues, their CDN, DNS provider etc etc.

eastdakota · on July 18, 2020

Because of the way we securely connect to StatusPage.io from most locations where our team is based. The traffic got blackholed in ATL, keeping us from updating it. An employee in our Austin office was finally able to use his Google Fi phone and connect through a PoP that wasn’t connected to our backbone so didn’t have traffic blackholed. Something we’ll address going forward.

itsjloh · on July 18, 2020

Damn that sucks, sounds like a stressful Friday evening for all involved. Thanks for taking the time to answer.

superkuh · on July 18, 2020

You mean the service that everyone is centralizing in caused problems because everyone centralized in it? Pikachu shock face. If you're web or network dev act responsibly. Don't just pick cloudflare because everyone else does. Don't pick cloudflare because everyone else does.

eric_khun · on July 18, 2020

I got the same issue today when I was on-call. took me 1 hour to figure out it was Cloudflare.

I'm currently working on a project to monitor all the 3rd party stack you use for your applications. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

dubcanada · on July 18, 2020

Not to take away from your project, but check out https://statusgator.com/

iJohnDoe · on July 18, 2020

Cloudflare is experiencing a bit of karma and a bit of Murphy’s Law.

They slung some mud not long ago and now it came to bite them. They were a bit righteous on their reliability. However, anyone in this game long enough knows it’s only a matter of time before shit goes down. If they didn’t have any graybeards over there to tell them this, then hopefully they earned some gray.

Stuff was down long enough for Googles and OpenDNS caches to expire, and to take down DigitalOcean in some respects.

Thankfully CF can afford to learn and make improvements for the future. Not all organizations are that lucky.

bnkamalesh · on July 18, 2020

> They slung some mud not long ago...

Oh, at whom?

bogomipz · on July 18, 2020

https://twitter.com/eastdakota/status/1143182575680143361?la...

pilif · on July 18, 2020

https://isbgpsafeyet.com/

Half of the internet

iJohnDoe · on July 18, 2020

Never mind, it wasn’t recently. Still kind of ironic.

https://news.ycombinator.com/item?id=20267790

wolco · on July 18, 2020

Digital Ocean uses flaire?

devy · on July 18, 2020

This is not the first time human error on BGP routing configuration which then caused a significant portion of the Internet down. Is there any kind of configuration validator that can be implemented to prevent and catch this type of errors? I am fairly sure this won't be the last time we will hear about human error on BGP routing config causing Internet down.

Or is BGP intrinsically a unsafe protocol without builtin protections on this sort of human mistakes?

dcow · on July 18, 2020

BGP is a fitting name for such a distinct plane of computation. Indeed the border where any remaining physical concerns are cut loose and reality melts, receding behind a shroud of gateways, giving way to the vast expanse of cyberspace. Traffic whirls past. Raw elemental ether flows with abundance in this region. Any wizard who happens to experience even brief exposure to it normally considers themselves lucky. But to be enlisted to serve as a warden of the border, whether punishment or honor, is a a responsibility most high. BGP is either the final, or the primal, abstraction depending on which side of the gates you most intimately inhabit. And the task of maintaining it a meticulous and manual art.

solotronics · on July 18, 2020

We live and die by the finite state machine, hallowed be the Protocol.

tomklein · on July 18, 2020

That was beautifully written.

spenczar5 · on July 18, 2020

It's really hard because it's often not really that your configuration is invalid. It's not a syntax error. It's that BGP provides a way for networks to tell each other about how they see the world, and sometimes what you say can melt down everyone else's systems.

It's a dynamic, constantly-changing system, and the effects of your actions may not always be seen - it's not always obvious how other networks behave and will react to your route announcements. And so even trying to snapshot the current state and say "hypothetically, would this change be a bad idea?" can be hard to get right.

Now, this particular case was of internal BGP use at Cloudflare, so everything I said doesn't necessarily apply... but it still does a little. Even internal networks can be so complicated, they may as well be un-analyzable.

I think the problems here are pretty deep, unfortunately, and have to do with our "network of networks" design.

apaprocki · on July 18, 2020

It would be interesting to classify network outages and determine the number that involved practices that would be obviated by a standard VCS / release process like found in software. Routers/firewalls seem to be a particular pain point everywhere.

jlgaddis · on July 19, 2020

I guess it's easy to become complacent when you're a networking expert at Cloudflare and likely making several of these ad-hoc, on-the-fly config changes every single day, but it's always good to remember why Juniper introduced the automatic rollback feature.

Of course, this particular outage would not have been prevented even if they had used

  # commit confirmed

as it can't stop you from screwing up but it almost certainly would have limited the duration of the outage to ~10 minutes (plus a minute or two, perhaps, for the network to recover after the rollback) -- and it could have been shorter than that had they used "commit confirmed 3", for example.

Even as a lowly network engineer working on networks much, much smaller than Cloudflare's, for pretty much my entire professional career, it's been my standard practice to start off pretty much ANY change -- no matter how trivial -- with

  # commit confirmed 3

or

  # reload in 3

or similar, depending on what type of gear I was working on (and, of course, assuming the vendor supported such a feature).

This applies even when making a changes that are so "simple" that they just "can't" go wrong or have any unexpected or unintended effects.

In fact, it applies ESPECIALLY in those case! It's when you let your guard down that you'll get hit.

---

Fortunately, all that was necessary (I assume) to recover in this case was to

  # rollback

to the previous configuration. Then, the correct configuration could be made. That still had to be done manually, however, and resulted in a 27 minute outage instead of what could have been a 5 or 10 minute outage.

I would hope that Cloudflare has full out-of-band access to all of their gear and are able to easily recover from mistakes like this. If they had lost access to the Atlanta router and weren't able to log in and revert the configuration manually, this outage could have lasted much, much longer.

mathattack · on July 18, 2020

Did this sink any eCommerce websites?

synunlimited · on July 18, 2020

Shopify went down so all of those storefronts were down.

mathattack · on July 18, 2020

Wow - so real dollars!

bryan_w · on July 18, 2020

Doordash went down

jitbit · on July 18, 2020

Everyone has outages, CloudFlare is a decent company making a good product.

BUT

What's interesting here is that so many non-CloudFlare services went down (including even AWS - partially) caused by DNS outage - because every sysadmin and his mom are using 1.1.1.1 as their DNS.

KozmoNau7 · on July 18, 2020

And that is precisely why you shouldn't try to centralize like that and send everything to and through the same vendor.

Decentralization is what makes the internet function in a robust fashion.

sm2i · on July 18, 2020

looks like Tesuto was a good buy

https://investors.fastly.com/news/news-details/2020/Fastly-A...:

> By emulating networks at scale, Tesuto’s technology can be used to create sandbox environments that simulate the entire Fastly network, providing a view into the potential impact of a deployment cycle before it is put into production.

kissgyorgy · on July 18, 2020

It's fine that they made mistakes and there were an outage, shit happens to everybody.

What is really scary though that half of the internet stopped working. That's not ok!

bamboozled · on July 18, 2020

It's amazing to me that in this situation, humans still had to intervene and update a configuration file.

I'm surprised stuff like this wouldn't happen more often and there would at least be a well tested, automated remediation step in place which also validates the change prior to going live.

I get they may be busy solving other issues, but it's interesting this isn't a more fool proof procedure given the huge impact a mistake can have.

microcolonel · on July 18, 2020

I wish they would go into why the rather complete outage was not visible to cloudflarestatus.com. I fully understand that mistakes can be made, but I'm really not pleased with how hard it was to tell if I was experiencing a localized issue. During the entire outage, cloudflarestatus.com displayed "all systems operational" for me, once accessed it with a functioning DNS resolver.