And even on platforms that allow unaligned access, isn't there usually a performance penalty when the access straddles two cache lines? Still best avoided if possible.
I don't often deal with CPU-coupled caches, but I could see a system where that happens.
Cortex-M4, which doesn't have integrated caches, breaks each 32-bit load into 1-3 aligned loads of between 8 and 32 bits according to the address %4. Cortex-M7 performs each 32-bit load as either 1x or 2x 32-bit aligned loads depending on address % 4.
I agree - especially in cases like this algorithm, where there is only one memory stream, it's often worth unrolling up to 3 or 7 accesses before the big aligned loop.