The analysis in the "Why does sorting win?" section of this article gave us a way to estimate that threshold.
Here's my attempt at it, based on that analysis:
Suppose each item key requires s bytes
For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.
The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass
p(N) = ceil(log2(N) / log2(buckets))
Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64
then p(N) = ceil(64 / 10) = 7 passes
7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.
We haven't found the threshold yet.
We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.
Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.
(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)
edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.
keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items
given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table
>This analysis would suggests a 2.7× speedup vs. hash tables: 128 bytes vs. 48 bytes of memory traffic per uint64
It's ok to waste bandwidth, it doesn't directly impact performance. However, the limitation you are hitting (which directly impacts performance) is the number of memory accesses you can do (per cycle for instance) and the latency of each memory access. With linear access, after some initial read, data is ready instantly for the CPU to consume. For scattered data (hash tables), you have to pay a penalty on every read.
So the bandwidth ratio is not the right factor to look at to estimate performance.
the big O copmlexity makes assumptions that break down in this case. E.g. it "ignores" memory access cost, which seems to be a key factor here.
[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model
I could imagine the hash table wins again beyond a even greater threshold. Like what about 120GB and beyond?