Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, I think it's less of a disconnect than a difference in priority. SRE's first priority is "stop the bleeding" -- take whatever immediate action you can to stop users from being hurt. That might mean rolling back the binary, reverting a data push, draining away from a broken cell, whatever. When you're serving thousands or millions of QPS, time is of the essence.

That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.

Source: I'm an SRE



One of the cases i was involved was when the issue was not found after 30 minutes, after sre rolled back most of the systems.

Reproducing the issue resulted in an immediate fix by the swe.

Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.


Immediate fix by SWE can only be released after it is tested and canaried, so it's not really "immediate" most of the time.


We run factories and if we had a bad deploy bring a factory down, it's not going to get "more broken", so we can push a fix-attempt change live as soon as it's ready.

Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)


That is not true. Production fires tend to skip canaries in a sizable amount of cases.

It's especially not true when big amount of money is at stake. (like ads).

Edit: last sentence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: