EDIT: Blergh, confused LOOP with REP, but keeping the below comment so the rest of the thread still makes sense.
FWIW LOOP isn't the worse thing in the world once you have dedicated silicon for it anyway generating micro ops in the instruction decode pathway. It's just a pretty cute run length encoding scheme for the instruction stream.
It's slow as sin, though. Just straight emulating it using more common instructions is like 4x better in most modern Intel CPUs. For some insane reason, it emits 8 uops on Skylake.
There is a reason why loop is (was made) slow: It was (in the 90s) explicitly made slow because it was used for timing loops. Making it faster would have broken existing software.
"IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow."
"(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)"
It's also very fast on AMD (not any slower than the equivalent dec/jnz), so use it if you want your software to run faster on AMD and slower on Intel...
Sure, it doesn't matter anymore because anyone who cares is going through the vector unit to do bulk transfers. But there was issues with doing unaligned base and length memory transfers for the longest time, well through x86_64's original design.