My best guess is to properly align instructions along a certain boundary, but I'm pulling that out of my hat.
You can put bcopy as a function call right before memmove, and then you don't need one function to call the other, which would cause a stack push. Maybe instructions need to be at addresses = 0 mod 16, so that's the "closest" you can get it. And spinning over ~12 NOPs might be faster than incrementing the PC by ~9.
Yes, there's been a recommendation floating around for a long time that branch/call targets should be 16-byte aligned. However, my experience is that modern x86 is basically almost completely insensitive to alignment in general (there are some small exceptions, like avoiding crossing cacheline boundaries in a loop.) It wouldn't surprise me if the NOPs only helped on earlier models, had no effect or even a negative effect on more recent CPUs, and were just left in out of tradition.
Depends on what you consider "modern". Core2 timings changed wildly based on code alignment (experienced myself and documented by others here: http://x264dev.multimedia.cx/archives/51).
Yeah, looks like it's for function alignment padding. It's a pretty common thing at the end of functions to have the next function start on a specific boundary. (even if the first function doesn't fall into the other)
I haven't tested, but I'd bet good money that 12 NOPs would be faster than a jmp.
You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.
Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.
Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)
I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.