I mean this in only a slightly judgemental way: What kind of change management/testing is going on over there at CF? This is not the first time that someone at CF has made a hasty config change that brought down a significant part of the network.
Now I'll admit that proper change management won't catch every issue, but the issue described in this post seems like something that should have been caught. It's a little worrisome that a company that so much of the internet relies on is apparently playing fast and loose with major config changes. The changes you describe in the post-mortem sound like they will fix the immediate problem and possibly prevent future occurrences of this exact same problem, but what about making broader changes about how you deploy these things?
Not only that, but this single config change not only brought down the CDN, but brought down both 1.1.1.1 and the secondary 1.0.0.1. Was this type of failure never tested, or...? What's the point in having both if they both go down at the same time?
I would assume CF operations folks would be constantly making changes that affect “production” many times a day. Some of it will be to be counteract an issue that originally started because of an external issue like they mention here.
In an ideal world, yes there would be higher level tools to do everything and checks and balances for everything, but I assume half the tools are like our standard Unix cli tools - all too powerful and the operator better know what they are doing
By the time they have all the right tools for everything, the company would have become a dinosaur disrupted by some other technology or company
>I would assume CF operations folks would be constantly making changes that affect “production” many times a day. Some of it will be to be counteract an issue that originally started because of an external issue like they mention here.
Even in an "emergency" situation like described in the post-mortem (which I'm actually confused about, because Prince's tweet said it was actually "routine maintenance", which makes it even less justifiable), there should be a standard playbook with pre-tested and pre-approved failover options to enact. There definitely should not be someone just manually fingering in a config file on the fly.
>In an ideal world, yes there would be higher level tools to do everything and checks and balances for everything, but I assume half the tools are like our standard Unix cli tools - all too powerful and the operator better know what they are doing
This is fine for some startup still hacking their way through an MVP, but for a global, multi-billion dollar company that prides themselves on supporting a significant part of the global internet, "the operator better know what they are doing" is unacceptable and not good enough. There should be multiple layers of protection against this stuff, so that even if an operator is having a nutty, they don't break half the internet.
>By the time they have all the right tools for everything, the company would have become a dinosaur disrupted by some other technology or company
I can't agree at all. We're talking about some basic change management and testing stuff here. I've been at plenty of organizations (even ones much smaller and with less resources than CF) that have gotten this right, and they certainly are not dinosaurs nor have they been "disrupted". Testing and change management is as fundamental as encryption or password hashing. If you don't have it, it's not because it's hard, it's just because you haven't bothered to implement it.
For the record, I don’t work for cloudflare. I was merely pointing out that failsafe and bulletproof tools for every situation is a myth in my humble opinion. If you have seen and work for organizations that know how to do this better, good for you.
>If you have seen and work for organizations that know how to do this better, good for you.
Let's be clear: I'm not talking about some small subset of fantasy companies. It is a strict compliance requirement for many of the world's largest industries that you must have things like change management processes, incident response playbooks, and testing in place. A "myth" it certainly is not. Again, it's a fundamental thing no different than the most basic security requirements that we also expect (which is no coincidence, as one of the components of many security frameworks is proper change management controls). I'm not saying it has to be bulletproof and perfect (in fact, it never will be), but you still have to have something.
Put another, possible more easily understood way: "don't test in production" is a requirement (some times even a legal requirement) for most companies once they reach a certain size. It's not just a funny saying that devs joke about.
Compliance can make you do something that looks like code review, something that looks like testing, something that looks like runbooks, etc.
That can't will into existence engineers who are actually thoughtful, creative, and skilled at those things. They can't make the test suite actually anticipate the bugs that will be written. They can't make the runbook have correct steps with no unwanted side effects for every failure mode. They can't make the reviewer see what the author doesn't.
Two, maybe three people in my 100-person org are legitimately good at this stuff. No amount of grandstanding about what a serious business we are changes that. In fact the people most enthusiastic about compliance and process usually end up undermining the real, substantive technical quality that the compliance process was aiming at.
Absolutely. I'll be the first to say that compliance doesn't mean you're actually good at those things (and I actually have said as much here on HN).
However, the point remains that things like change management aren't some nebulous, pie-in-the-sky concept. That they are in compliance frameworks speaks to the fact that they are very fundamental and basic (compliance frameworks are typically the bare minimum of things you should be doing).
A half-assed CM process won't catch all errors by any means. But it will catch the most basic ones. And according to CF's post-mortem, the initial issue that kicked off the chain of events was "backbone congestion", which has got to be one of the most basic things that would be included in an IR playbook (Prince even referred to it as "routine maintenance" in an earlier message). And then the config change that was put in place also seems like a fairly basic change that probably would have been caught with with some basic testing.
That's what catches my eye the most. This incident wasn't some wild, niche, couldn't-have-been-predicted event. It was "backbone congestion", followed by an attempted change of routes on the backbone to alleviate the congestion. For an internet services company to not have a standard, pre-approved and pre-tested solution for resolving something as predictable as backbone congestion is shocking to me.
You are right if the facts are exactly as presented and there is nothing more to know.
I would say we should ready between the lines here. Most postmortems are marketing artefacts to some sense designed to reassure customers so taking them at face value is not really good idea, I have written and read enough to know that it is rarely ever the full story for variety of reasons.
It is quite possible like you say they are playing fast and loose with config and they have systemic process risks as you point out.
It is also possible that the issue is not just routine congestion i.e. it is symptom and not the root cause, and they are not talking about it in detail either to make the communication simple enough for non networking experts to understand, but has enough depth for technical folks to kind of follow along and reassure their managements that cloudflare has handle on it . It could also be revealing more may give their competitors some crucial information about their IP/architecture and they are calling it routine congestion to not reveal that.
Lastly, all of the above are not mutually exclusive, it could be all of them in parts too.
Attitudes like this are the reason that we don't have better practice industry wide. Seeing quality control as the enemy of success doesn't create good software, it creates technical debt and discourages prospective ops engineers.
The teams I've seen be the most comfortable with these knowably bad processes have been the strongest advocates against guardrails. The cost to your reputation and bottom line when something does go wrong is opaque to many engineers, but these kinds of ridiculous errors do lose customers.
Edit: A common saying where I work is that there are no stupid mistakes only bad processes. The first step to rooting them out is admitting that they're process level issues.
My problem with process/policy is that it tends to attract a) bureaucrats, and b) the kind of engineer who is not particularly interested in or good at building products but still wants to ladder-climb.
These are not the people I want dictating my workflow.
That doesn't mean it's the "enemy of success." If we could raise its status to attract people with genuine technical competence and empathy for the realities of building, we could do a lot better.
>Edit: A common saying where I work is that there are no stupid mistakes only bad processes. The first step to rooting them out is admitting that they're process level issues.
Human stupidity is unbounded; such a process asymptotically approaches "do not change production ever." If you're a monopoly with a money printer, then keeping it running probably is more important than any improvement you could make. If you're under competitive pressure, you need to ship, and shipping necessarily involves risk.
Can you name one large service/platform provider where this hasn't happened? We know the likes of Amazon, Google, Microsoft, et al have had similarly idiotic causes to go down. I'd either give them all of my business or start betting against them as their day is due.
The fact that such a critical problem can only be detected by a human noticing it is, itself, part of the problem. Having multiple people look at the change instead of just one is a little bit better, but it's still just a band-aid. Ideally, there would be automated processes in place that could prevent and/or mitigate this kind of thing. (If we were talking about software instead of configuration, how would you view an organization that required every commit to be reviewed, but had no automated tests?)
One possibility would be to parse the config file and do a rudimentary simulation, taking into account traffic volumes, and warn the user if the result would be expected to overload any routes.
Another possibility would be to do something a bit smarter than just instantly, blindly propagating a change to every peer at the same time. If the bad routes had been incrementally rolled out to other routers, there might have been time for someone to notice that the metrics looked abnormal before Atlanta became overloaded. (I don't know whether that's feasible with BGP, or if it would require a smarter protocol, but it seems like it would be worth looking into.)
Finally, it seems like if a config change is made to a router and it immediately crashes, there should be systems that help to correlate those two events, so that it doesn't take half an hour to identify and revert the change.
It’s probably easy to say this as an outside party, but anyone with the most basic understanding of Junos would have known that by deactivating all of the entries in the from section of a policy or policy term, the accompanying from section would apply to all routes.
Ensuring that the appropriately-knowledgeable people for various types of changes, as well as appropriate testing processes, are part of the change management workflow is a part of good change management.
Now I'll admit that proper change management won't catch every issue, but the issue described in this post seems like something that should have been caught. It's a little worrisome that a company that so much of the internet relies on is apparently playing fast and loose with major config changes. The changes you describe in the post-mortem sound like they will fix the immediate problem and possibly prevent future occurrences of this exact same problem, but what about making broader changes about how you deploy these things?
Not only that, but this single config change not only brought down the CDN, but brought down both 1.1.1.1 and the secondary 1.0.0.1. Was this type of failure never tested, or...? What's the point in having both if they both go down at the same time?