The fact that such a critical problem can only be detected by a human noticing it is, itself, part of the problem. Having multiple people look at the change instead of just one is a little bit better, but it's still just a band-aid. Ideally, there would be automated processes in place that could prevent and/or mitigate this kind of thing. (If we were talking about software instead of configuration, how would you view an organization that required every commit to be reviewed, but had no automated tests?)
One possibility would be to parse the config file and do a rudimentary simulation, taking into account traffic volumes, and warn the user if the result would be expected to overload any routes.
Another possibility would be to do something a bit smarter than just instantly, blindly propagating a change to every peer at the same time. If the bad routes had been incrementally rolled out to other routers, there might have been time for someone to notice that the metrics looked abnormal before Atlanta became overloaded. (I don't know whether that's feasible with BGP, or if it would require a smarter protocol, but it seems like it would be worth looking into.)
Finally, it seems like if a config change is made to a router and it immediately crashes, there should be systems that help to correlate those two events, so that it doesn't take half an hour to identify and revert the change.
One possibility would be to parse the config file and do a rudimentary simulation, taking into account traffic volumes, and warn the user if the result would be expected to overload any routes.
Another possibility would be to do something a bit smarter than just instantly, blindly propagating a change to every peer at the same time. If the bad routes had been incrementally rolled out to other routers, there might have been time for someone to notice that the metrics looked abnormal before Atlanta became overloaded. (I don't know whether that's feasible with BGP, or if it would require a smarter protocol, but it seems like it would be worth looking into.)
Finally, it seems like if a config change is made to a router and it immediately crashes, there should be systems that help to correlate those two events, so that it doesn't take half an hour to identify and revert the change.