Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm skeptical that "hashing is much faster on an array the size of a power of two".

Also, if memory is actually an issue, he should store hashes instead of full objects (assuming the byte arrays are more than 16 bytes).



IIRC, the real reason for this "power of 2" is for fast rehashing when the table grows. Calculate a 32 bit hash and store it all, but if your table currently has 1024 entries, only use 10 bits of it. Later, when your table grows to 2048 entries, you can just mask off 11 bits worth instead of recomputing every hash.

But, since the hash table is not going to grow in this particular case, any perf gain from using a power of 2 will probably be far less important.


He's using a power of 2 so he can use a bitwise-and to mask out his hash into the number of array indexes.


It's fairly common to keep array sizes bounded at a power of two, because when you take a hash you can use bitwise shifting instead of a modulus to determine the hash bucket to use. Depending on the performance of your hash function it can have a noticeable impact, although you're right that it's probably not a matter of 'much' faster.


Using a modulo with a non-power-of-2 is going to introduce modulo bias.


I bet that there's no measurable difference in this code. It just bugs me to see so much unnecessary complication based on blindly imitating an optimization that maybe made sense 20 years ago.


I've noticed the difference between modulo and bitshift in java code written in the last five years. If it's getting 99.9+% cache hits, being used in an innermost loop, and jdk7 hasn't gotten around to JITing special cases in modulo, this could absolutely matter.


It was 10% faster to switch from modulo to & on a recent real world problem for me.


I believe you. There are two reasons I think paul's right in the context of hash tables.

1. If you're using a non-trivial hash function, one more modulo calculation at the end to get the actual memory address is not a big difference.

2. If you're using a trivial hash function, I bet for a lot of data sets you'll get fewer collisions with a hash table whose size is a prime number, canceling out any benefit from calculating the address slightly faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: