Google SREs are pretty good, but there is disconnect between product and sre (somewhat understandably). SREs will always take steps to mitigate issue. rolling back binaries, mlts, gdps, you name it. Sometimes, it's actually easier to fix the problem, but SRE doesn't know it. It's understandable, they manage many many projects, and they can't know all the products. It's just something i found very interesting.
Well, I think it's less of a disconnect than a difference in priority. SRE's first priority is "stop the bleeding" -- take whatever immediate action you can to stop users from being hurt. That might mean rolling back the binary, reverting a data push, draining away from a broken cell, whatever. When you're serving thousands or millions of QPS, time is of the essence.
That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.
We run factories and if we had a bad deploy bring a factory down, it's not going to get "more broken", so we can push a fix-attempt change live as soon as it's ready.
Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)
What's the right role for someone who wants to deeply know some products enough to fix the code the right way, but doesn't want to be a dev? (Be a dev somewhere where maintenance is as valued as creation?)
I think SRE is what you are looking for. They typically know the product pretty well. The reason why they rollback rather than fix the bug immediately is that they want the outage to by fixed in minutes. Even if you spotted the bug immediately you would not have enough time to do the build, let alone run the tests.
I think that's about right. I think I just started doing work that sounds very much like SRE work to me: I'm building a CI pipeline, E2E tests and "Dockerizing" an existing Java-based project management product (currently only deployed as SAAS, but on-site deployment is in the backlog).
I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.
After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.
It's the most enjoyable work I've had so far. I think it mostly boils down to:
* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements
* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my
* I'm not currently on the critical path, so work feels low-stress
* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)
* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).
Technical solutions engineering (TSE) may fit the bill, especially for more mature products. Think support on steroids, where you're empowered to fix the customer's problem.
Obligatory disclaimer: I'm one, and we're hiring ;)
Obvious PS: Google employee.