Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Moral of the post: don't do premature optimization. Common adage but it's a good reminder and example of it.

Aside from one case where OP argued that the queries were so rare compared to the data updates that maintaining the index is more expensive. Which is also pretty classic when you're taught about indexes.

What I recently learned the hard way about indexes is that they're slow when you have to read "large" chunks of a table to find an answer, in my case computing a median for an area. I'm indexing geospatial data and Europe+NA are overrepresented. When you view an area the size of ~Germany, the query would take 20 minutes and compute the median over ~3% of all records whereas a full table scan (whole world) took something like 40 seconds (with the median being computed over the same 3%, it would just evaluate the WHERE clause against every row instead of finding rows to include via the index). That's the power of a sequential read as compared to reading an Rtree. I haven't had such bad experiences with Btree indexes, not sure if that would behave just as badly. On the other hand, if you ask for an area the size of Amsterdam, the index is normal fast, and for <50 rows it's basically instant whereas the full table scan would still take the same 40 seconds.



The cause of this is the likely N+1-like behaviour of indexes on un-clustered data. Two options for speeding this up a ton (they dont' make sense to use together),

1. Cluster your data using the Gist/r-tree index. postgis docs [1] great explanation

2. Use the r-tree as covering index. IE add your associated data as dimensions into the index. The gist index becomes (lat, long, weather_c, avg_altitude), etc. This avoids having to load a separate page for each row in the spatial index [2]

[1]: https://postgis.net/workshops/postgis-intro/clusterindex.htm...

[2]: https://www.postgresql.org/docs/current/indexes-index-only-s...


Thanks! I don't think MariaDB offers the former but it's good to know this exists and I may want to look elsewhere. This isn't the only thing I ran into with MariaDB, but I'm also hesitant to subscribe to maintaining another service into eternity when the existing database has done fine for everything (more than a decade of projects) up until this project. If I had to redo it, I'd definitely try this out though to see if it's worth it! And at work it's also a different equation. So it's useful info :)

The latter, I considered but didn't think would cut the time by more than half whereas it would need an order of magnitude or more speedup to be useful. Doing a full table scan and caching the result for certain zoom levels was the solution for me.


Adding an index is an incredibly expensive task on a dataset so big that it needs one to be added.

It's something that is likely to require massive engineering effort on a live system.

Removing an index, on the other hand, is never an issue.

It's not at all premature optimization, it's simply basic software design.


Read OP's post please. They're talking of systems where it was acceptable to always do full table scans, in one case having 350 rows.

Not "an incredibly expensive task on a dataset so big that it needs one". Thus I read OP's post as recommending to not do premature optimization (without using those words literally).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: