The article writer (Nathan) used uniformly distributed cryptographically secure ...

codexon · on June 10, 2009

Hello, this is the author. (btw Nathan is a commenter, not me)

It was exactly my conclusion that the hash function and data is more important than using mod prime.

Try to think of it this way. The ideal hash function takes your not so random data and outputs random data.

I = low-medium entropy data

H(I) = high entropy output equivalent to random numbers

bucket = (H(I) or random number) % m <--- experiment is here.

Now to simulate this, I used the previous model, as it is the definition of a good hash function. If you have repetitive data in, and repetitive data out, no modulus is going to help you.

So yes, the conclusion of your comment and mine are the same: Worry about your data and hash function. But many people came out of the woodwork trying to poke holes in the experiment.

barrkel · on June 10, 2009

Except for the last bit - a prime modulus ameliorates weaknesses in the hash function.

Since most hash tables are written to support user-supplied hash functions, and users need to write hash functions to support value-based equality when storing values of custom types in hash tables, it makes a lot of sense for the hash table implementer to take this into account.

Finally, there's a trade-off: sometimes a fast hash function with some weaknesses, but using a prime modulus, can be a better choice than a slower, more thorough hash function with e.g. a power of two modulus (bitwise and).

calambrac · on June 10, 2009

If you have repetitive data in, and repetitive data out, no modulus is going to help you.

Um, no. A prime modulus will often help in this case. That's the whole point.

codexon · on June 10, 2009

Do you have any statistical evidence?

calambrac · on June 10, 2009

In your test code, change your loop to:

    for r in [3, 5, 17, 257]:
        for i in xrange(n):
            num = r * i
            table_p[num%prime] += 1
            table_c[num%composite] += 1

On here and on your blog, you keep falling back to the observation that modulo prime makes no difference when you have a uniform distribution of inputs, like you get after you apply a good hash function.

Nobody who is telling you that you're wrong would dispute that. The fundamental point that you keep ignoring is that doesn't matter, because as a library writer providing a hashtable, you don't control the hash function. Having a simple way to mitigate the ill effects of users choosing bad hash functions is a good thing. Why do you keep ignoring this point?