That is what baffles me. The difference in big O complexity should be more visib...

ants_everywhere · 2025-09-11T12:32:18 1757593938

As others have pointed out, radix sort is O(64N) = O(N) for a fixed key length of uint64s as in this article

So it comes down to the cost of hashing, hash misses, and other factors as discussed in the article.

I remember learning that radix sort is (almost?) always fastest if you're sorting integers.

A related fun question is to analyze hash table performance for strings where you have no bound on string length.

kwillets · 2025-09-11T15:04:55 1757603095

Sorting wins for unbounded strings because it only needs a prefix of each string.

shoo · 2025-09-11T11:41:02 1757590862

The analysis in the "Why does sorting win?" section of this article gave us a way to estimate that threshold.

Here's my attempt at it, based on that analysis:

Suppose each item key requires s bytes

For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.

The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass

p(N) = ceil(log2(N) / log2(buckets))

Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64

then p(N) = ceil(64 / 10) = 7 passes

7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.

We haven't found the threshold yet.

We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.

Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.

(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)

edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.

keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items

given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table

2 * 8 * (log2(N) / log2(1024)) = 2 * 64

alain94040 · 2025-09-12T22:20:37 1757715637

>This analysis would suggests a 2.7× speedup vs. hash tables: 128 bytes vs. 48 bytes of memory traffic per uint64

It's ok to waste bandwidth, it doesn't directly impact performance. However, the limitation you are hitting (which directly impacts performance) is the number of memory accesses you can do (per cycle for instance) and the latency of each memory access. With linear access, after some initial read, data is ready instantly for the CPU to consume. For scattered data (hash tables), you have to pay a penalty on every read.

So the bandwidth ratio is not the right factor to look at to estimate performance.

cdavid · 2025-09-11T13:21:50 1757596910

the big O copmlexity makes assumptions that break down in this case. E.g. it "ignores" memory access cost, which seems to be a key factor here.

[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model

dgreensp · 2025-09-11T13:46:16 1757598376

Radix sort is not a comparison-based sort and is not O(n log n).