> when they should be able to kick the app over the wall and let infra ensure that the app is deployed in two separate PCR zones with the failover plan etc, which should itself be mostly automated
Not entirely - the developers should actively participate in designing the actual failover scenario and making sure the application can handle that (anything from being okay with some downtime due to the failover happening to designing an actual multi-region multi-master application). Making assumptions like 'infra will handle it' is a great way to not only get unexpected outages (because the developers assumed there would be no downtime because failover is magic, or that writes will never be lost) but to also introduce tensions between teams (because you now have an outside team having to wrangle an application into reliability when the original authors don't give a crap about it).
I get and agree with your point, the tooling and processes should definitely be simplified/automated when possible, and developers deserve a working platform that just works. The whole point of a platform team is to abstract away the mundane to let people do their job. But reliability is everyone's job, not just the infra's team, and developers must understand the tradeoffs and technology involved in order to not design broken systems.
A) It's doing a horrible job conveying it. A dev does need to be concerned on how to handle failover, but only at a certain abstraction level. They should be required to specify something in the form "given server A fails and has to pass to B, what do you do?" That does not require you to know the terminology about PCRs and how to make decisions about which cells (or whatever) to pick on deployment, or avoiding the "gotcha" about making sure the two servers are in different PCR zones.
At that point, it's just following a checklist that needs no knowledge of the specifics of the app, and, to the extent that it's accurately representing how Google was, is indicative of bad processes.
B) Many things should be infra's job, as they're cleanly orthogonal to what dev's are doing. For example, how to apply a security patch to a DB. That's unrelated to the operation of the app.
I do get your point though, and I wouldn't say something like this about e.g. testing (which was the short, "reasonable" part of the video!) -- the devs have intimate knowledge of what counts as passing and failing and should be writing tests, and not 100% passing it over to QA. But that's precisely because such concerns are deeply tied in to the thing they are concerned with. "SQL 3.4.1 vs 3.4.2" is not.
Not entirely - the developers should actively participate in designing the actual failover scenario and making sure the application can handle that (anything from being okay with some downtime due to the failover happening to designing an actual multi-region multi-master application). Making assumptions like 'infra will handle it' is a great way to not only get unexpected outages (because the developers assumed there would be no downtime because failover is magic, or that writes will never be lost) but to also introduce tensions between teams (because you now have an outside team having to wrangle an application into reliability when the original authors don't give a crap about it).
I get and agree with your point, the tooling and processes should definitely be simplified/automated when possible, and developers deserve a working platform that just works. The whole point of a platform team is to abstract away the mundane to let people do their job. But reliability is everyone's job, not just the infra's team, and developers must understand the tradeoffs and technology involved in order to not design broken systems.