Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometime around 2000, I tried to hand optimise an image processing routine in x86 assembly. Previously I'd only done 6502.

My first attempt was massively slower than the compiled code. I had to get into the whole pipelining that was going on and other hardware optimisations.

When I discovered putting a NOP in a tight loop sped up the code, I realised I wasn't going to beat the compiler most of the time!



I also had that thing with the NOP happen to me once, with the program with the extra NOP running 2x as fast! Took a couple of days until I finally figured out what was going on.

After much investigation what I found out was that the original code without the NOP was actually running at only 1/2 the speed that it should. Due to very bad luck, the addresses of the jump targets in the inner loop where placed in a configuration where the branch predictor failed to predict the jumps (perhaps because of collisions in the internal "hash tables" used by the jump predictor). Any nudge to the executable would get the program out of the pathological configuration. Using a different compiler version, different OS , or different CPU model all did the trick. But the most fun of course was that adding or removing a NOP also made the difference :)


Raymond Chen has an entire article on the use of NOP in Windows 95 code. In one case, they had to fix a bug with a 32-bit NOP, because using a 16-bit NOP would introduce a different bug!

https://devblogs.microsoft.com/oldnewthing/20110112-00/?p=11...


Was it instruction alignment or something more mysterious?

My anecdote from hand optimization is that I didn't even know some instructions existed.


Really insightful comment.

It's true. C compilers are so optimized, the day of re-write in assembly have long passed for most of us.


it is the dance between compilers and chip designs that unmakes hand work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: