I think you can read about some of these changes in Google's SRE and SWE books (even if they don't mention this video in particular), at least the ones most likely to be interesting to someone outside Google.
But dropping Borgmon readability was the most immediate and obvious. It was basically true that no one had Borgmon readability. The policy was a catch-22: you couldn't get readability for the simple/formulaic Borgmon macro invocations that were encouraged and often sufficient. You could only get it for doing something "clever". I got it by writing fancy borgmon rules to paper over a problem that (in hindsight) I should have solved elsewhere.
Another was easing quota management. IMHO the most unbelievable thing in the video was that after Broccoli Man told Panda Woman to get quota in two cells, she just said "done". Besides the hassle in transcribing what you needed into the request system [1], various types of quota were chronically unavailable where you needed them, even in tiny amounts. In 2010, I kept a critical infrastructure service running by regularly IMing major clients' on-calls asking them to donate 0.1 cpu(!) of their quota in some cell or another when I didn't have quite enough to grow. There was a "gray market" mailing list where people would trade resources they couldn't get through the primary system. But eventually, they built a system that for small services would make the quota just happen for you.
Overall, it was a kick in the pants for the most basic infrastructure teams that made them see how unnecessarily hard this is for their internal customers, prompting them to make small things just happen while keeping large things possible. In any large organization, it's healthy to get this kind of feedback regularly. The actual specific changes and technologies are pretty specific to Google in 2010...
[1] Many people managed this very very tediously with spreadsheets. I eventually wrote a tool to generate the requests based on comparing your intended production config with your current quota.
Every single buzzword uttered in the video has a cloud equivalent except for PCR. These days people just give up on regional (even zonal) outages and just take downtime and blame their cloud provider in the rca so doesn’t matter. The big thing that i think is missing in this thread is just how enormous google infra was even at 2010 so these problems were and still are sort of unique
I think I may have used your tool at some point. There were times in geo where we'd run out of bigtable quota and I needed to hit the "emergency loan" button to keep maps from going down globally.
That's why you had these problems: you should have just let maps go down globally and then pointed the finger at the appropriate target to blame for it.
I'm only half joking about that too. The other half is that sometimes it is better to let the system collapse under its own weight when that's what is required to convince the right people the system is broken.
Production priority quota horse trading in the days before it was easy was a real skill. But non-production quota was free and virtually infinite, even in those days.
Or converting between priority levels or users that were under your product. If you ran a large enough service, there was also begging your users for spare quota or warning them that their recent donation couldn't be fully deployed because a necessary change in some secondary borgcfg template had eaten into the service's resources, so it would help everybody if they knew anyone with a few cores laying around...
By the way, do you happen to be the jwb that filed a... creative BT quota request?
Hah, I was the one that played along and made you keep going ("May you be gifted with a long, healthy life and an exquisite taste in ties"?). And I think SAD was one of the teams that occasionally helped us find resources in this or that cluster, along with Analytics and, later, Social.
I seem to remember hearing prod quota disputes described as “monkey knife fights” by SRE. SWE would joke about using corp credit cards on EC2 instances rather than waiting on the outcome.
Thing is, though, something equally ridiculous happens at every large company and at more than a few small companies. Worse in government. The important thing is how they react when it's pointed out.
But dropping Borgmon readability was the most immediate and obvious. It was basically true that no one had Borgmon readability. The policy was a catch-22: you couldn't get readability for the simple/formulaic Borgmon macro invocations that were encouraged and often sufficient. You could only get it for doing something "clever". I got it by writing fancy borgmon rules to paper over a problem that (in hindsight) I should have solved elsewhere.
Another was easing quota management. IMHO the most unbelievable thing in the video was that after Broccoli Man told Panda Woman to get quota in two cells, she just said "done". Besides the hassle in transcribing what you needed into the request system [1], various types of quota were chronically unavailable where you needed them, even in tiny amounts. In 2010, I kept a critical infrastructure service running by regularly IMing major clients' on-calls asking them to donate 0.1 cpu(!) of their quota in some cell or another when I didn't have quite enough to grow. There was a "gray market" mailing list where people would trade resources they couldn't get through the primary system. But eventually, they built a system that for small services would make the quota just happen for you.
Overall, it was a kick in the pants for the most basic infrastructure teams that made them see how unnecessarily hard this is for their internal customers, prompting them to make small things just happen while keeping large things possible. In any large organization, it's healthy to get this kind of feedback regularly. The actual specific changes and technologies are pretty specific to Google in 2010...
[1] Many people managed this very very tediously with spreadsheets. I eventually wrote a tool to generate the requests based on comparing your intended production config with your current quota.