Sometime around 2000, I tried to hand optimise an image processing routine in x8...

ufo · on April 25, 2020

I also had that thing with the NOP happen to me once, with the program with the extra NOP running 2x as fast! Took a couple of days until I finally figured out what was going on.

After much investigation what I found out was that the original code without the NOP was actually running at only 1/2 the speed that it should. Due to very bad luck, the addresses of the jump targets in the inner loop where placed in a configuration where the branch predictor failed to predict the jumps (perhaps because of collisions in the internal "hash tables" used by the jump predictor). Any nudge to the executable would get the program out of the pathological configuration. Using a different compiler version, different OS , or different CPU model all did the trick. But the most fun of course was that adding or removing a NOP also made the difference :)

lmkg · on April 26, 2020

Raymond Chen has an entire article on the use of NOP in Windows 95 code. In one case, they had to fix a bug with a 32-bit NOP, because using a 16-bit NOP would introduce a different bug!

https://devblogs.microsoft.com/oldnewthing/20110112-00/?p=11...

praptak · on April 25, 2020

Was it instruction alignment or something more mysterious?

My anecdote from hand optimization is that I didn't even know some instructions existed.

ijidak · on April 25, 2020

Really insightful comment.

It's true. C compilers are so optimized, the day of re-write in assembly have long passed for most of us.

lizardmancan · on April 26, 2020

it is the dance between compilers and chip designs that unmakes hand work.