Bit operations are fast, branches are slow.

DannyBee · on Dec 13, 2022

Well, yes, but most of these branches can be if-converted into conditional moves anyway if you want.

So they are data dependencies and not control ones. There are still some control ones.

Beyond that, let me be super clear: Imagine the following six versions:

1. One window at a time processing. No early exit, no bit munging.

2. One window at a time processing. No early exit, bit munging.

3. One window at a time processing. You don't use direct bit munging, but you early exit the window when you hit the non-unique character, and skip the window forward to the first instance of that non-unique character.

IE given "hmma<something>", n=4, you early exit at the second m, and skip processing windows forward to right after the first m (since no window prior to that can be unique, as they will all hit the double m)

4. One window at a time processing, bit munging, same otherwise as #3

5. Sliding window processing, no bit munging

6. Sliding window processing, bit munging.

The speedup between the 1-2 vs 3-6 is much greater than 5 vs 6 and 3 vs 4.

You could probably make 3 pretty darn competitive with 6, and 4 could probably beat 6 with SIMD (IE processing multiple windows in parallel really fast, and maybe wasting work, vs guaranteeing you only do the minimum work to find a unique window)

dahfizz · on Dec 13, 2022

I don't think you are correct. Certainly not with a word size of 4. For very large word sizes, the early return will be a big win. But you are really under-rating how expensive the hashmap will be vs "bit munging".

You could roll your own perfect hash, but you would end up with a solution that looks almost identical to TFA. If you use a stdlib hash, you are going to be chasing pointers all over memory to do a single insert / lookup. De-referencing a single pointer not in cache costs 50 - 100 clock cycles on a modern system. By the time you do one insert, TFA will have XOR'd at least 32 chars into its bit mask. And your cache won't be as nice as in TFA, slowing you down even more.

Given the two scenarios: 1) bit munging, no early return 2) hash lookup, early return

I would guess scenario 1 wins until word size gets above ~256. And obviously, we can add an early return to scenario 1 to make it unquestionably the fastest.