I love 5+ why's. I find it to be a fantastic tool in many situations. Unfortunat...

hamburglar · 2025-10-14T05:44:38 1760420678

The excellent thing I learned about 5 whys is that not only is it not really just 5, as you allude to with “5+”, but it’s also a *tree* instead of a linear list. Often a why will lead to more than one answer, and you really have to follow all branches to completion. The leaf nodes are where the changes are necessary. Rather than identifying one single thing that could have prevented the incident, you often identify many things that make the system more robust, any one of which would have prevented the incident in question.

crazygringo · 2025-10-13T23:46:12 1760399172

> I'd also have kicked back a doc that was mostly about blaming the other team.

Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)

Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.

dwattttt · 2025-10-14T01:10:34 1760404234

Quis turmas probationum examinat?

Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?

And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?

mrexroad · 2025-10-14T03:13:40 1760411620

> how do you write a 5 why's that explains how you'd typo'd a flag

Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…

- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?

- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?

- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews? Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…

Etc etc.

daxfohl · 2025-10-14T04:53:50 1760417630

Curious, do your 5 why's actually look like this, kind of stream-of-consciousness? Because I love this! Our org's 5 why's are always a linear 5 steps back that end at THE ROOT CAUSE. And those are the good ones. Others are just a list of five things that happened before or during the incident.

I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".

crazygringo · 2025-10-14T02:26:43 1760408803

Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested, that someone on your team reviews before signing off.

It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.

albedoa · 2025-10-14T03:35:09 1760412909

> Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested

Once you find out the heart surgeon shows up drunk to the operating room, you make sure there is an additional nurse there to hold his arm steady.

crazygringo · 2025-10-14T12:53:27 1760446407

:P I mean, obviously assuming you don't have the choice of changing your testing team. But even if you do, what if they're worse?

dwattttt · 2025-10-14T05:59:13 1760421553

I... with the evocative scenario... would choose another remedy, rather than have a nurse steady the drunken surgeon's arm.

rcxdude · 2025-10-14T08:41:16 1760431276

I think that's the point. If you have an incompetent team or team member the number of checks around them can grow astronomically and still you will have problems. At a certain point the systemic problem can become "the system is unwilling to replace this person/team with a competent one".

(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)