Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I looked up nop sleds but don't see why they're used here. Explanation?


My best guess is to properly align instructions along a certain boundary, but I'm pulling that out of my hat.

You can put bcopy as a function call right before memmove, and then you don't need one function to call the other, which would cause a stack push. Maybe instructions need to be at addresses = 0 mod 16, so that's the "closest" you can get it. And spinning over ~12 NOPs might be faster than incrementing the PC by ~9.


Yes, there's been a recommendation floating around for a long time that branch/call targets should be 16-byte aligned. However, my experience is that modern x86 is basically almost completely insensitive to alignment in general (there are some small exceptions, like avoiding crossing cacheline boundaries in a loop.) It wouldn't surprise me if the NOPs only helped on earlier models, had no effect or even a negative effect on more recent CPUs, and were just left in out of tradition.


Depends on what you consider "modern". Core2 timings changed wildly based on code alignment (experienced myself and documented by others here: http://x264dev.multimedia.cx/archives/51).


Yeah, looks like it's for function alignment padding. It's a pretty common thing at the end of functions to have the next function start on a specific boundary. (even if the first function doesn't fall into the other)

I haven't tested, but I'd bet good money that 12 NOPs would be faster than a jmp.


You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.

Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.


What does a 9byte NOP look like?


https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64... has

0x66 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00

That's a size override prefix, followed by the dedicated NOP instruction (0x0f 0x1f), and finally 6 bytes to encode an effective address with offset.


Multi-byte nops have compatibility issues on some of the more obscure 32-bit x86 CPUs, unfortunately: https://sourceware.org/bugzilla/show_bug.cgi?id=13675


Right… you have to check cpuid for the long nop feature. I believe 0x66 0x90 is compatible (but slow, I would expect) with older CPUs.


    nop word ptr [eax+eax+0] ; 66 0f 1f 84 00 00 00 00 00


Do JMPs invalidate caches? That was the story I was telling myself where 12 NOPs would be faster than JMPing; I don't know if it's true.


Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)


I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.


Long nops would be marginally quicker.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: