Perhaps a stupid question, but why they don’t test the firmware updates internal...

agilob · 2025-03-14T19:56:58 1741982218

Today a tech lead with admin role on GH opened a PR, approved it for himself and merged it, because he could override GH rules. The PR had failing unit tests. It went straight to prod and caused 20 minutes downtime of one functionality. We do test things, sometimes you're just not prepared for all the permutations of the idiocy out there...

This is more common than you think. Only a few days HP update bricked their printers https://arstechnica.com/gadgets/2025/03/firmware-update-bric...

Similar thing happened to Hisense https://old.reddit.com/r/Hisense/comments/18xnmz9/the_latest...

Samsung phones: https://www.androidcentral.com/phones/galaxy-s10-phones-smar...

MattGaiser · 2025-03-14T20:00:10 1741982410

The answer seems to be that things get tested, but the results often get ignored.

agilob · 2025-03-14T20:13:07 1741983187

Human error, don't worry, we will be getting rid of these pesky humans soon

mikepurvis · 2025-03-14T19:51:21 1741981881

They almost certainly do, but there's always ways that the test jig differs from the units in the field, for example:

- The test jig is probably pristine, so no hundreds of hours of telemetry data clogging up the internal storage.

- The test jig might be on ethernet whereas a lot of users would be using wifi.

- The test jig probably targets specific A -> B upgrades rather than testing progressive upgrade across every version that's ever existed.

- The test jig can't cover every permutation of config options.

- The test jig probably only does a bare minimal smoke test after the install, so if the problem takes a bit to kick in, it might not show up.

Not to say that it's certainly any of these, but all are possible contributors. In the coming days it'll become clearer what particular pattern the affected devices follows, and/or clever people with JTAG dongles will reverse engineer the problem and spill the beans.

Y_Y · 2025-03-14T20:24:06 1741983846

The test jig should be in expected conditions. We have simulated tests, and we have tests that run on the devices on my desk, but we also have a real world setup for consumer devices in a separate building that could be mistaken for the real deployment environment. That's not feasible for every company, but it's certainly feasible for Samsung. It doesn't mean you'll catch everything, but it does address some of your points.

mikepurvis · 2025-03-14T20:40:06 1741984806

There's no question about what it should be, but without technical leadership up the chain that understands and insists on this, it's easy to see how it could atrophy over time with cuts and staff turnover.

Like once upon a time, someone established a lab with twenty different units in different states, and put in place a process for validating the releases on it, but that person is long gone, and parts of the lab haven't worked quite right in years, but the parts that do still give a green checkmark, and who wants to stick their neck out and block a release over some baroque process no one even understands, right? It's not like the lab ever seems to really catch a major issue, does it? Just send a :ship: emoji to the slack channel and wait to be assigned your next ticket in the sprint meeting.

sumedh · 2025-03-14T21:26:29 1741987589

You dont need a testing team when the users can do all the testing for you.

kkarpkkarp · 2025-03-14T19:52:04 1741981924

so what are the users for? /s