I looked up nop sleds but don't see why they're used here. Explanation?

danielweber · on Dec 1, 2014

My best guess is to properly align instructions along a certain boundary, but I'm pulling that out of my hat.

You can put bcopy as a function call right before memmove, and then you don't need one function to call the other, which would cause a stack push. Maybe instructions need to be at addresses = 0 mod 16, so that's the "closest" you can get it. And spinning over ~12 NOPs might be faster than incrementing the PC by ~9.

userbinator · on Dec 2, 2014

Yes, there's been a recommendation floating around for a long time that branch/call targets should be 16-byte aligned. However, my experience is that modern x86 is basically almost completely insensitive to alignment in general (there are some small exceptions, like avoiding crossing cacheline boundaries in a loop.) It wouldn't surprise me if the NOPs only helped on earlier models, had no effect or even a negative effect on more recent CPUs, and were just left in out of tradition.

derf_ · on Dec 2, 2014

Depends on what you consider "modern". Core2 timings changed wildly based on code alignment (experienced myself and documented by others here: http://x264dev.multimedia.cx/archives/51).

cjubb39 · on Dec 1, 2014

Yeah, looks like it's for function alignment padding. It's a pretty common thing at the end of functions to have the next function start on a specific boundary. (even if the first function doesn't fall into the other)

I haven't tested, but I'd bet good money that 12 NOPs would be faster than a jmp.

pbsd · on Dec 1, 2014

You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.

Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.

0x0 · on Dec 1, 2014

What does a 9byte NOP look like?

pkhuong · on Dec 1, 2014

https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64... has

0x66 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00

That's a size override prefix, followed by the dedicated NOP instruction (0x0f 0x1f), and finally 6 bytes to encode an effective address with offset.

makomk · on Dec 2, 2014

Multi-byte nops have compatibility issues on some of the more obscure 32-bit x86 CPUs, unfortunately: https://sourceware.org/bugzilla/show_bug.cgi?id=13675

pkhuong · on Dec 2, 2014

Right… you have to check cpuid for the long nop feature. I believe 0x66 0x90 is compatible (but slow, I would expect) with older CPUs.

pbsd · on Dec 1, 2014

    nop word ptr [eax+eax+0] ; 66 0f 1f 84 00 00 00 00 00

danielweber · on Dec 2, 2014

Do JMPs invalidate caches? That was the story I was telling myself where 12 NOPs would be faster than JMPing; I don't know if it's true.

jlebar · on Dec 2, 2014

Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)

pbsd · on Dec 2, 2014

I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.

pkhuong · on Dec 1, 2014

Long nops would be marginally quicker.