Especially when targeting realistic machine models, there are a lot of things that are suboptimal about the classical sorting algorithms like quicksort or mergesort. For example, a quicksort with perfect choice of pivot will incur a branch miss with probability 50% for every element. That's not something classical complexity analysis measures, but on actual CPUs, branch misses have quite an impact (something like 10–15 cycles). Shameless plug for something I once built for a lecture to demonstrate this: https://github.com/lorenzhs/quicksort-pivot-imbalance
Of course these algorithms are much more complex and error-prone to implement and use some additional memory, which may explain why they're not used in standard library implementations of popular languages.
This is really cool - how did you develop the background knowledge to solve this? I'm trying to learn more about low-level stuff and I would have no idea how to approach solving a problem like this
For sorting algorithms that take the behaviour of modern CPUs into account, check out ips4o (https://arxiv.org/abs/1705.02257, code: https://github.com/SaschaWitt/ips4o) or for a simpler algorithm that's still much faster than quicksort in most cases, blockquicksort (https://arxiv.org/abs/1604.06697, code: https://github.com/weissan/BlockQuicksort). Note that both papers were published in the last two years :)
Of course these algorithms are much more complex and error-prone to implement and use some additional memory, which may explain why they're not used in standard library implementations of popular languages.