A 10x more complex redundant (or "redundant") system often breaks faster (and definitely stays down longer) than a simple direct system.
Many people just don't consider failure scenarios. Offsite live database backups, for example, are a great idea. Say ... how does your site perform, in percent of normal QPS, when the database is now 150ms away instead of 1ns? 1% ... that's not redundancy, despite the site being up, let's just call that a failure.
And people forget one thing about hosting on AWS. Say ... when AWS is slow/has problems/blocked on firewall/down ... when your competitor is down, would you like your site to be up? How about vice versa?
The database had a fallback that, according to good practice, was hosted in a database in a different city, another provider (different country actually, but this is Europe, it wasn't actually that far in km, it was, however >100ms away).
Because they had a really fast local database essentially all the time, every pageview started requiring more and more database queries, some 50 for the front page alone, as the developers added features.
Then the database needed to failover. And the complexity hadn't actually killed IT (yet): it actually worked ... But of course 50 * 2 * 150ms = 15000, or 15 seconds per page.
I'm saying simple things can be better even when they don't provide redundancy because there's a bunch of problems that increase complexity so much that you can literally fix a simple problem faster than redundancy can take care of it.
Your comment reads like a sales pitch for RDS. We have failover replicas in different geographically distributed datacenters. Failovers happen more or less instantly and the added latency (~0.5ms) is fine.
So for us this doesn't increase complexity, it greatly reduces it while increasing availability and general confidence, even though the underlying system (RDS/Aurora) is clearly very complex.
If you're running a tinpot site with a single developer on a couple of pet servers, then fair enough. But it's definitely not correct to say that a simple, direct system is the epitome of reliability. It's not.
I don't understand this. Actually I use SQLite linked in to the site code itself. Very tough to beat on a whole host of metrics. Shared data is spread like configuration.
I don't run a tinpot site but I am the sole developer managing a couple web servers and services for my company. Our db failure works much like you described. Maybe a .5ms added latency if that regardless of physical location more like .1ms from our metrics.
Every quarter we test it, and without fail, it has worked. So I added maybe a couple hours of research and two minutes of additional configurations during set up for reliably fast failover. Seems pretty no brainer to me.
When we moved to a managed instance with our cloud provider it took even less time to set up failover, maybe 30 seconds.
With the numerous options for cloud offerings these days I see no reason to not have a failover set up whether you are a massive corporation or a small business.