Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is what SipHash was made for: it's a cryptographic MAC that's competitive in speed with other hash functions meant for hash table use. The paper describes its design, and the weaknesses with competitors like Python's randomized hashing:

https://www.131002.net/siphash/siphash.pdf

Some code for it here:

https://www.131002.net/siphash/



Interesting, though at 140 cycles to hash 16 bytes, it's much slower than state-of-the-art hash functions, which can hash 16 bytes in just over 16 cycles. If you seed your non-cryptographic hash function with a cryptographically random number, you can get the best of both worlds: fast hashing and collision resistance against an adversary.


Not really. The SipHash page contains proof-of-concept code that breaks common good non-cryptographic randomized hashes, i.e., it generates collisions regardless of the chosen seed.

Regarding performance, SipHash is essentially as fast as Python's randomized hash, and quite faster for long inputs (>16 bytes), cf. http://bench.cr.yp.to/results-auth.html#amd64-sandy .


That code appears to rely on having direct access to the hash function; this seems very unlikely to be leaked to an attacker. More interesting would be code that gets the secret with only access to ordered iteration on the table, which is more likely to leak.

I don't know anything about Python's hash function, all of those benchmarks are still an order of magnitude slower than CityHash (for example), which peaks at 0.16 cycles/byte.

I would be interested to see SipHash put through SMHasher, which is the most comprehensive hash function test suite I know of: http://code.google.com/p/smhasher/


Seed recovery (which requires access to hash function) is for Python hash. For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Of course, it passes SMHasher with 10 score -- it's a cryptographic PRF.


> For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Are you referring to the "Advanced hash flooding" section of the paper, or am I missing this in another place? That analysis appears to rely on the hash function being highly linear/periodic (ie. you can always add "l" to get back the same hash value). I don't think that City, Spooky, or Murmur are nearly this linear. For example, City is based on CRC32 (ie. polynomial division).

Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic hash. But modern non-cryptographic hash functions are so fast (especially when they're CPU-accelerated, like City is on Intel thanks to the CRC32 instruction) that I don't think SipHash can claim to be the obvious choice for all hash tables.

I am interested to know though if the random seed + City/Spooky/Murmur approach really does have practical hash flooding attacks.


It's pretty quick to generate a lot of collisions, and the set of inputs you get will collide independently of the seed. Look (CityHash64):

$ time ./citycollisions 1000000 > /dev/null

real 0m0.904s

user 0m0.439s

sys 0m0.008s

$ ./citycollisions 3

128-bit key 741b848b0065aec04bb25b26ca2c3e9

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = d3290c3d600b8add

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = d3290c3d600b8add

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = d3290c3d600b8add

$ ./citycollisions 3

128-bit key 5e2a23015c89da98172fbcb77ad98468

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = 59d0b041b16594b

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = 59d0b041b16594b

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = 59d0b041b16594b


Wow, collisions against the best non-cryptographic hash functions that are independent of seed... color me impressed! This does then seem to strongly suggest that a cryptographic hash function is the only way to defend against hash flooding attacks. Props to the SipHash authors, this is some ground-breaking work.


See Downloads section here https://www.131002.net/siphash/

Edit: wait, I'm mistaken -- these are not independent of seed.


I think they are, actually! The code just uses "seed" in a slightly confusing way -- it's seeding its own algorithm with "seed" but using "key" as the seed for the hash function.

This is really impressive work.


Aha, thanks for clearing that up.


Isn't Python open source? Along with a large quantity of server-side software? I'd assume everyone already knows my hash functions, I didn't make any new ones.


He means access to [seeded] hash function output.


The strategy you propose gives immunity to the more straightforward attacks on hash tables, but in practice tends to leak information about your secret key through differences in the time taken by hash table operations. If you check out pages 14--16 of the SipHash paper, it goes into a bit more detail about this. You may also find Appendix B entertaining.


Can you name some of these speedy state-of-the-art hash functions?



But you said 16 cycles.

"For example, hashing 16 bytes takes 141 Bulldozer cycles with SipHash-2-4, against 82 and 126 for CityHash and SpookyHash, and 600 for MD5."

https://www.131002.net/siphash/siphash.pdf (p.12)


According to CityHash's benchmarks (http://code.google.com/p/cityhash/source/browse/trunk/README), they can hash 8 bytes in 6ns and 64 bytes in 9ns on a 2.67GHz machine (so 16-24 cycles).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: