8 words to an L1 line on the original iPhone ARM, so yes, you'd fit it into a smaller table. You'll still face the memory latency issue if you're using this outside a tight loop. (Cache is only 4-way set associative. But at least I & D cache are separate)
But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.
Why yes, yes I do miss fiddling around with low level details. :)
If you're talking to a dev with a background in firmware dev, I'd say you'd be hard pressed to find any body who'll put a single asm instruction into a separate function if it's speed-critical.
Sure, the compiler might (and probably will) inline, but it means any change in your tool chain or a whim of the inline heuristic can cause serious performance regressions that are completely avoidable.
When it comes to performance, the embedded mindset will always be "belt and suspenders" ;)
But really, if you spend that much time thinking about the performance, you really shouldn't have that abstracted into a function. Calling that costs cycles and blows out one entry in your return stack - which makes a more costly mispredict later on likely.
Why yes, yes I do miss fiddling around with low level details. :)