Don't fix it if it isn't broken

orf · on Aug 15, 2022

It was and has been broken. A few times.

dvfjsdhgfv · on Aug 15, 2022

And lessons have been learned. Still far better uptime than most websites with huge resources thrown at them.

candiodari · on Aug 15, 2022

A 10x more complex redundant (or "redundant") system often breaks faster (and definitely stays down longer) than a simple direct system.

Many people just don't consider failure scenarios. Offsite live database backups, for example, are a great idea. Say ... how does your site perform, in percent of normal QPS, when the database is now 150ms away instead of 1ns? 1% ... that's not redundancy, despite the site being up, let's just call that a failure.

And people forget one thing about hosting on AWS. Say ... when AWS is slow/has problems/blocked on firewall/down ... when your competitor is down, would you like your site to be up? How about vice versa?

orf · on Aug 15, 2022

> Offsite database backups, for example, are a great idea

“Off-site database backups” that mean you now have a 150ms round trip for user facing queries? … what on earth are you talking about.

Is your argument that “simple” things are better because “you can do stupid things and blame it on complexity”?

candiodari · on Aug 15, 2022

The database had a fallback that, according to good practice, was hosted in a database in a different city, another provider (different country actually, but this is Europe, it wasn't actually that far in km, it was, however >100ms away).

Because they had a really fast local database essentially all the time, every pageview started requiring more and more database queries, some 50 for the front page alone, as the developers added features.

Then the database needed to failover. And the complexity hadn't actually killed IT (yet): it actually worked ... But of course 50 * 2 * 150ms = 15000, or 15 seconds per page.

I'm saying simple things can be better even when they don't provide redundancy because there's a bunch of problems that increase complexity so much that you can literally fix a simple problem faster than redundancy can take care of it.

orf · on Aug 15, 2022

Your comment reads like a sales pitch for RDS. We have failover replicas in different geographically distributed datacenters. Failovers happen more or less instantly and the added latency (~0.5ms) is fine.

So for us this doesn't increase complexity, it greatly reduces it while increasing availability and general confidence, even though the underlying system (RDS/Aurora) is clearly very complex.

If you're running a tinpot site with a single developer on a couple of pet servers, then fair enough. But it's definitely not correct to say that a simple, direct system is the epitome of reliability. It's not.

candiodari · on Aug 21, 2022

I don't understand this. Actually I use SQLite linked in to the site code itself. Very tough to beat on a whole host of metrics. Shared data is spread like configuration.

keneda7 · on Aug 15, 2022

I don't run a tinpot site but I am the sole developer managing a couple web servers and services for my company. Our db failure works much like you described. Maybe a .5ms added latency if that regardless of physical location more like .1ms from our metrics.

Every quarter we test it, and without fail, it has worked. So I added maybe a couple hours of research and two minutes of additional configurations during set up for reliably fast failover. Seems pretty no brainer to me.

When we moved to a managed instance with our cloud provider it took even less time to set up failover, maybe 30 seconds.

With the numerous options for cloud offerings these days I see no reason to not have a failover set up whether you are a massive corporation or a small business.