I love 5+ why's. I find it to be a fantastic tool in many situations. Unfortunately, when leadership does not reward a culture of learning, Five Why's can become conflated with root cause analysis and just become a directed inquiry for reaching a politically expedient cause. The bigger the fuck up, the more it needs an impartial NTSB-like focus on learning and sharing to avoid them in the future.
Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.
The excellent thing I learned about 5 whys is that not only is it not really just 5, as you allude to with “5+”, but it’s also a *tree* instead of a linear list. Often a why will lead to more than one answer, and you really have to follow all branches to completion. The leaf nodes are where the changes are necessary. Rather than identifying one single thing that could have prevented the incident, you often identify many things that make the system more robust, any one of which would have prevented the incident in question.
> I'd also have kicked back a doc that was mostly about blaming the other team.
Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)
Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.
Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?
And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?
> how do you write a 5 why's that explains how you'd typo'd a flag
Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…
- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?
- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?
- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews?
Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…
Curious, do your 5 why's actually look like this, kind of stream-of-consciousness? Because I love this! Our org's 5 why's are always a linear 5 steps back that end at THE ROOT CAUSE. And those are the good ones. Others are just a list of five things that happened before or during the incident.
I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".
Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested, that someone on your team reviews before signing off.
It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.
I think that's the point. If you have an incompetent team or team member the number of checks around them can grow astronomically and still you will have problems. At a certain point the systemic problem can become "the system is unwilling to replace this person/team with a competent one".
(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)
Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.