Wow, thanks for pointing this out. I just tried it out for myself, and indeed "rep movsb" is consistently (and sometimes significantly) faster than the standard C memcpy for aligned copies larger than 16KB or so on my Intel Core i5 (for unaligned copies, it seems to be on par). It is slightly slower or on par for smaller sizes. There is no noticeable difference between rep movsb and repmovsq.
Apparently, libc hasn't caught up to those micro-architecture changes yet :/
glibc is generally pretty good, but does lag commercial libc implementations somewhat when it comes to microarchitectural optimization. It's also not unheard of for Linux distros to include rather old versions of glibc. I have no idea if that's the problem in your case, but it's worth checking that you have the latest.
Apparently, libc hasn't caught up to those micro-architecture changes yet :/