Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article writer (Nathan) used uniformly distributed cryptographically secure random numbers as his test data. Any decent hash for integers that doesn't throw bits away is going to produce a nice even distribution for that kind of sequence no matter what the divisor.

The problem with hash tables is bad hash functions, or hash functions which perform badly on certain data, where "perform badly" means recurrence of certain patterns, usually "latent periods", i.e. too many function results that are divisible by some number or one of some set of numbers. If those numbers have factors in common with the bucket count, then there's going to be trouble.

For example, a poor hash function might turn up even numbers twice as often as odd numbers; if the bucket count is also even, then the even buckets will have 50% more collisions than the odd buckets, on average. A prime number of buckets guarantees that there's only one factor that can clash, rather than a bunch of them.



Hello, this is the author. (btw Nathan is a commenter, not me)

It was exactly my conclusion that the hash function and data is more important than using mod prime.

Try to think of it this way. The ideal hash function takes your not so random data and outputs random data.

I = low-medium entropy data

H(I) = high entropy output equivalent to random numbers

bucket = (H(I) or random number) % m <--- experiment is here.

Now to simulate this, I used the previous model, as it is the definition of a good hash function. If you have repetitive data in, and repetitive data out, no modulus is going to help you.

So yes, the conclusion of your comment and mine are the same: Worry about your data and hash function. But many people came out of the woodwork trying to poke holes in the experiment.


Except for the last bit - a prime modulus ameliorates weaknesses in the hash function.

Since most hash tables are written to support user-supplied hash functions, and users need to write hash functions to support value-based equality when storing values of custom types in hash tables, it makes a lot of sense for the hash table implementer to take this into account.

Finally, there's a trade-off: sometimes a fast hash function with some weaknesses, but using a prime modulus, can be a better choice than a slower, more thorough hash function with e.g. a power of two modulus (bitwise and).


If you have repetitive data in, and repetitive data out, no modulus is going to help you.

Um, no. A prime modulus will often help in this case. That's the whole point.


Do you have any statistical evidence?


In your test code, change your loop to:

    for r in [3, 5, 17, 257]:
        for i in xrange(n):
            num = r * i
            table_p[num%prime] += 1
            table_c[num%composite] += 1

On here and on your blog, you keep falling back to the observation that modulo prime makes no difference when you have a uniform distribution of inputs, like you get after you apply a good hash function.

Nobody who is telling you that you're wrong would dispute that. The fundamental point that you keep ignoring is that doesn't matter, because as a library writer providing a hashtable, you don't control the hash function. Having a simple way to mitigate the ill effects of users choosing bad hash functions is a good thing. Why do you keep ignoring this point?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: