Sometime around 2000, I tried to hand optimise an image processing routine in x86 assembly. Previously I'd only done 6502.
My first attempt was massively slower than the compiled code. I had to get into the whole pipelining that was going on and other hardware optimisations.
When I discovered putting a NOP in a tight loop sped up the code, I realised I wasn't going to beat the compiler most of the time!
I also had that thing with the NOP happen to me once, with the program with the extra NOP running 2x as fast! Took a couple of days until I finally figured out what was going on.
After much investigation what I found out was that the original code without the NOP was actually running at only 1/2 the speed that it should. Due to very bad luck, the addresses of the jump targets in the inner loop where placed in a configuration where the branch predictor failed to predict the jumps (perhaps because of collisions in the internal "hash tables" used by the jump predictor). Any nudge to the executable would get the program out of the pathological configuration. Using a different compiler version, different OS , or different CPU model all did the trick. But the most fun of course was that adding or removing a NOP also made the difference :)
Raymond Chen has an entire article on the use of NOP in Windows 95 code. In one case, they had to fix a bug with a 32-bit NOP, because using a 16-bit NOP would introduce a different bug!
My first attempt was massively slower than the compiled code. I had to get into the whole pipelining that was going on and other hardware optimisations.
When I discovered putting a NOP in a tight loop sped up the code, I realised I wasn't going to beat the compiler most of the time!