loopXX instructions do not use CPU LSD (Loop Stream Detector) while cmp/jnz construct takes advantage of it. This speeds up some small loops. Also, there are some rules in intel manuals for instructions within cmp/jnz loop like no mismatched push/pop, etc.
My guess is virtual stack pointer update prediction latency.
To expand on that, Intel's CPUs have had for a long time a separate piece of hardware dedicated to a "virtual" stack which speeds up push/pop instructions. If pushes and pops are not mismatched, then all stack operations can stay entirely within that and there's no need to update the "real" stack pointer nor stack entries upon leaving the loop.
LOOP is a bit of a weird case. I've seen it benchmark both slower and faster than dec/jnz depending on the surrounding instructions.