And even on platforms that allow unaligned access, isn't there usually a perform...

bobmcnamara · on July 4, 2024

I don't often deal with CPU-coupled caches, but I could see a system where that happens.

Cortex-M4, which doesn't have integrated caches, breaks each 32-bit load into 1-3 aligned loads of between 8 and 32 bits according to the address %4. Cortex-M7 performs each 32-bit load as either 1x or 2x 32-bit aligned loads depending on address % 4.

I agree - especially in cases like this algorithm, where there is only one memory stream, it's often worth unrolling up to 3 or 7 accesses before the big aligned loop.