Google SREs are pretty good, but there is disconnect between product and sre (so...

packetslave · on Jan 28, 2017

Well, I think it's less of a disconnect than a difference in priority. SRE's first priority is "stop the bleeding" -- take whatever immediate action you can to stop users from being hurt. That might mean rolling back the binary, reverting a data push, draining away from a broken cell, whatever. When you're serving thousands or millions of QPS, time is of the essence.

That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.

Source: I'm an SRE

tehlike · on Jan 28, 2017

One of the cases i was involved was when the issue was not found after 30 minutes, after sre rolled back most of the systems.

Reproducing the issue resulted in an immediate fix by the swe.

Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.

general_ai · on Jan 28, 2017

Immediate fix by SWE can only be released after it is tested and canaried, so it's not really "immediate" most of the time.

sokoloff · on Jan 28, 2017

We run factories and if we had a bad deploy bring a factory down, it's not going to get "more broken", so we can push a fix-attempt change live as soon as it's ready.

Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)

tehlike · on Jan 28, 2017

That is not true. Production fires tend to skip canaries in a sizable amount of cases.

It's especially not true when big amount of money is at stake. (like ads).

Edit: last sentence.

geofft · on Jan 28, 2017

What's the right role for someone who wants to deeply know some products enough to fix the code the right way, but doesn't want to be a dev? (Be a dev somewhere where maintenance is as valued as creation?)

rumcajz · on Jan 28, 2017

I think SRE is what you are looking for. They typically know the product pretty well. The reason why they rollback rather than fix the bug immediately is that they want the outage to by fixed in minutes. Even if you spotted the bug immediately you would not have enough time to do the build, let alone run the tests.

willemmali · on Jan 28, 2017

I think that's about right. I think I just started doing work that sounds very much like SRE work to me: I'm building a CI pipeline, E2E tests and "Dockerizing" an existing Java-based project management product (currently only deployed as SAAS, but on-site deployment is in the backlog).

I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.

After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.

It's the most enjoyable work I've had so far. I think it mostly boils down to:

* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements

* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my

* I'm not currently on the critical path, so work feels low-stress

* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)

* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).

_0nac · on Jan 28, 2017

Technical solutions engineering (TSE) may fit the bill, especially for more mature products. Think support on steroids, where you're empowered to fix the customer's problem.

Obligatory disclaimer: I'm one, and we're hiring ;)

tehlike · on Jan 29, 2017

Why do you not want to be a dev?