Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And even on platforms that allow unaligned access, isn't there usually a performance penalty when the access straddles two cache lines? Still best avoided if possible.


I don't often deal with CPU-coupled caches, but I could see a system where that happens.

Cortex-M4, which doesn't have integrated caches, breaks each 32-bit load into 1-3 aligned loads of between 8 and 32 bits according to the address %4. Cortex-M7 performs each 32-bit load as either 1x or 2x 32-bit aligned loads depending on address % 4.

I agree - especially in cases like this algorithm, where there is only one memory stream, it's often worth unrolling up to 3 or 7 accesses before the big aligned loop.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: