When you give responsibility to teams themselves the result is O(1) size problems becoming O(teams) sized problems.
> How can I check the health of the service?
In the definition of service, you define a field for
health check script.
> How can I safely and gracefully restart the service?
This will exist within the script used to push new code.
> Does it has any external dependencies?
This could be defined in the service configuration and
used for setting up integration tests and automatically
generating a dependency dashboard.
> Do you have a playbook, or sequence of steps, to bring
the service back up?
You could generate a field in the service defintion to
automatically generate a dashboard and include the
playbook link at the top of the page.
> Do you use appropriate logging levels depending on the
environments?
Production could be extremely opinionated about what
acceptable logging looks like, forced via code review. Log
level could be defined in service config.
> Are you logging to stdout?
Why would any production service get to choose?
Service owners shouldn't be able to log into machines.
> Are you measuring the RED signals?
Required fields in service config that could be used to
generate a service dashboard.
> Is there any documentation/design specification for the
service?
Required config field.
> Are you using gRPC or REST?
Trivial grep.
> How does the data flow through the service?
This is complicated, but can probably be easily replaced
by asking what state your service keeps and how it's
stored. This is the only question I think the author
should/needs to ask.
> Do you have any PII/Sensitive data flowing through the
service?
While this question is important, this is one of the
problems that has to be a particular person's
responsibility. Any dev that answers anything but
"probably not, but I don't know" shouldn't be trusted.
> What is the testing coverage for this service?
Some form of this would exist in a service config.
I don't think the question of responsibility is as simple as "it's the team's problem."
And then you could generate a webpage with a dropdown where "CoolAppServerName.prod" is an option and the dashboard including graphs for the time series metrics "CoolAppServerName.prod.5xx" and "CoolAppServerName.prod.latency_percentiles" automatically show up. Maybe instead of having service names in the dropdown you have owner names in the dropdown.
You could potentially write some code that attempts to validate no significant changes in those metrics and use it to automatically verify that newly pushed code didn't take down the website.
Service config means creating an authoritative service identifier (authoritative because it's the only identifier used in tooling) and then attaching a configuration to it.
Facebook and google have (or at least at some point had) tupperware and borg respectively, that are basically custom verisons of the above extended for their infrastructures.
When Kubernetes was released, it was thought it would be a successor to borg if not the key components of borg itself, IIRC. https://en.wikipedia.org/wiki/Kubernetes:
The design and development of Kubernetes was influenced by
Google's Borg cluster manager. Many of its top contributors
had previously worked on Borg;[15][16] they codenamed Kubernetes
"Project 7" after the Star Trek ex-Borg character Seven of Nine[17]
and gave its logo a seven-spoked wheel.
There was a lot of early skepticism about it because it was not borg. I guess my understanding is that borg is so integrated into google tooling that it would have been impossible to generalize.
I haven't used it myself yet because a few of the senior engineers (from google/fb) I respect said "absolutely not in our infra."
A completely bespoke solution. I was both too busy to and too inexperienced with kubernetes to get into it and have a conversation.
IIRC the main criticisms were that it wouldn't scale to our needs and there were some use cases that wouldn't be handled by kubernetes easily. The end result would be two different solutions for the same problem, a slow migration to kubernetes that may or may not stall out, and then a half finished/perpetual migration that would double support costs.
"> Do you have any PII/Sensitive data flowing through the
service?
While this question is important, this is one of the
problems that has to be a particular person's
responsibility. Any dev that answers anything but
"probably not, but I don't know" shouldn't be trusted."
GDPR makes it the responsibility of the organisation to know. You can't safely say "I don't know" about PII.
And if an organization wants to know, then they must make a single individual responsible. "Organizational responsibility" means that no one is responsible.
It is important to have one person know the answer, rather than making your devs "guess" the answer. "The devs we asked said there wasn't misuse of PII" is not at all a good guarantee that PII is not abused or lost.
The organization cannot know unless there is an individual who knows.