On the flip side, a missing index can bring down production. We experienced that...

mumblemumble · on May 25, 2023

I used to be one of the "keep the database running" people at a high frequency trading firm. So, super high pressure; any outage meant we were potentially hemorrhaging money at a brain-melting pace.

My lessons from that experience were twofold. First, you can't plan for sudden performance regressions ahead of time. The number of stars that have to align for that anticipatory index to actually help you are just too great. Even if you correctly guessed that a given table might be affected by one in the future, you'll never guess what fields it needs to include, and in which order, to support whatever new query or query plan change or whatever caused the problem. Second, Murphy's Law dictates that any attempt at preventing future read performance regressions by speculatively creating an index will end up causing a write performance problem instead.

Better instead to just get in the habit of periodically reviewing query plans for every query in the database. If you know the scaling characteristics of the various operations and the kinds of statistics changes are likely to cause the optimizer to choose different ones (and you should), then it's easy enough to select for better scaling characteristics. This is, incidentally, an excellent excuse for making the engineers use sprocs or a clean data access layer or something instead of a heavyweight ORM, so that you have some sensible way of getting a warning before they change queries on you. You can't effectively aim for a target that's bouncing around erratically. Even better still - and only realistically feasible if you somehow managed to win that bigger war I just proposed - set up a tool to monitor query plans and send you an alert when they change unexpectedly, so you can hear about the problem when it's still small.

winrid · on May 25, 2023

Are you just trying to create employment for yourself? This is amazingly bad advice. I literally get hired to fix these setups by people that take this approach.

Yes, it's good to periodically review query plans.

But designing and planning for data retrieval is part of a design process. And it's easy. Even Django since 2003 allowed you to define indexes inside your models. Got a 100m record table and you're adding a column you want to query by interatively? Rollout an index. This is normal day-to-day software development...

The write overhead is wild speculation. 99% of business are read heavy.

lazide · on May 25, 2023

It sounds like he's thinking of it from the DBA perspective, where they have to react to sudden changes in behavior from devs with little or no warning - since the devs don't talk to them.

DBAs doing proactive index creation when they don't know what the devs are doing is indeed futile.

The devs, however, should definitely be doing proactive database design/planning whenever they do anything, since they can (and will!) cause emergencies and downtime by not considering the impact of how they are interacting with/using the database.

If the devs are directly writing SQL, this is also relatively easy to get them into doing. If they're using a heavyweight ORM, it's nearly impossible to figure out what SQL it's going to run sometimes (and difficult to even trigger in a test), and so the devs often won't even try.

mumblemumble · on May 25, 2023

Exactly this.

The company had actually banned new uses of ORM and was in the process of eliminating existing usage of it while I was there. We had discovered that teams that used ORM had much higher production incident rates than teams that didn't, and it was fairly directly attributable to lack of understanding and predictability of what was happening in the database.

Maybe not a huge deal if you're in a low-load situation. But HFT requires the A game at all times because you never know when some exciting news that causes trading volume to increase 50-fold almost instantaneously might happen.

For the record, I was a dev and not a DBA. But I did work closely with the DBAs. And I was pretty irritated when I found out ORM was banned, because it definitely made the "writing new code" part of the job more laborious. But, hey, learning experience - it turns out that it was the right move. In the long run we were able to move faster once we stopped having to deal with the blowback from breaking things quite so often.

It's a little bit like when I play Mario Kart with my kids. Why do I go so much faster than them? Mostly because they are constantly pushing hard on the accelerator button, while I ease off the gas and even hit the brakes sometimes. They think speed management is annoying. I think that I spend a lot less time bouncing off of walls and giving precious coins to Lakitu.

magicalhippo · on May 26, 2023

Fortunately we're small enough that us devs are effectively the DBAs as well. Meaning, we write the queries and we maintain the DB schema.

I can indeed imagine life being a lot different if you're on the receiving end of an unknown barrage of queries.

fnordpiglet · on May 25, 2023

Was this because of a missing index or because the optimizer vomited on the queries without an index to provide it statistics? My gripe with RDBs is generally the query optimizer is non deterministic with respect to the query, I.e., it can make a great decision with certain statistics and flip to a terrible one with slightly different statistics even though the original query would have performed basically the same.

I’d rather have a database that query performance degrade gracefully and predictably with scale than some sort of black magic mumbo jumbo wake me up in the middle of the night because it lost its statistical little puny pea brain simply because it couldn’t compute a bounded expectation any more.

londons_explore · on May 25, 2023

I really want the optimizer to make an estimate of the CPU/IO to complete a query. Then, during the query, if that much effort has been expended and we haven't yet completed the query, then update the estimate. If the updated estimate now shows that the query plan is no longer quickest, then abort the query and restart with a different plan.

Years ago I forked postgres and tried to implement the above. Initial results were promising - there were a good chunk of queries that ended up taking a different plan, and sometimes returning 100x quicker.

Alas, the postgres codebase is heindously complex, and implementing the above to be production grade would be many months work - and, due to the way postgres streams results to the client, might have actually required a change to the wire format.

magicalhippo · on May 25, 2023

In this case it was a missing index, on a query run on every order line when saving aka a lot.

It had gone under the radar because the volume with our other customers had been low enough that the table scan of the queried table wasn't noticed.

As mentioned a customer suddenly got 1000x the volume and it quickly became an issue.

But yea, we have a job running in the weekends to recalculate statistics on key tables, as we've had issues with that grinding production to a halt before.

And recently I sped up a query by 100x by removing a filter from the where clause that for some weird reason caused the DB to run a really poor plan. It was just a simple check intended to filter out some duplicates in a few edge cases, but couldn't find a way to make the DB run it as a post-predicate. Moved it to my code for the 100x performance win...

remus · on May 25, 2023

> ...an unexpected 1000x increase in volume at a customer brought the DB to its knees.

From an outside perspective it seems like the huge increase in volume was more the issue! It sounds like an index helped a lot, but it would also have added cost for all those customers who didn't see that jump in volume.

magicalhippo · on May 25, 2023

Well, of course the volume had something to do with it, but adding the missing index meant the system could easily handle that volume.

The other customers pay for that index as well of course but either the volume is low enough that it's trivial or it's large enough that they too saw ann increase in speed.