If I could choose, I would like everything to run at the max turbo frequency all the time, yeah.
Still, and despite writing this post which will make a lot of people express something similar to what you wrote, I consider myself an AVX-512 fan, not the other way around. It's the most important ISA extension since, well, I'm not sure: a long time (probably AVX and AVX/2 combined would have a similar impact).
It introduces a whole ton of stuff that is very powerful: full-width shuffles down to byte granularity with awesome performance, masking of every operation, often free, compress and expand operations, and a longer list at [1]. That's only from an integer angle too (what I care about).
Yeah, it's taken AVX-512 a while to get traction (the fact that generation after generation of new chips have just been Skylake client derivatives with no AVX-512 hasn't helped), but I hope we are reaching a turning point.
These transitions are something you have to deal with if you want max performance, and I think we'll come up with better models for how to make the "global" decision of whether you should be using AVX-512.
The never-ending Skylake is/was a real problem. Intel was slowly adding features in a manner where it made sense to target last n generations but then all that came to a perpetual stop and suddenly we have this new extension that you can only really use on the very latest and most expensive, with virtually no backwards compatibility.
The instructions are sufficiently different from AVX2 that any appropriate use is not as simple as sticking it behind a gate and using a smaller block size, it basically requires a completely separate (re)write to properly take advantage of.
> The instructions are sufficiently different from AVX2 that any appropriate use is not as simple as sticking it behind a gate and using a smaller block size, it basically requires a completely separate (re)write to properly take advantage of.
I'd say yeah, you often need a rewrite of the core loop to take full advantage, but you can still more or less write AVX-style code in AVX-512 if you want, and take advantage of the width increase.
The main difference I think for most code is the way the comparison operators compare into a mask register. It would have been nice if they had just extended the existing compare into SIMD reg (0/-1 result) instructions too, to ease porting.
> it basically requires a completely separate (re)write to properly take advantage of.
Why? At a higher level of abstraction, you can dispatch simd instructions at the max width available. At least, that's how I work with vectorized code. Still see gains on avx512.
Still, and despite writing this post which will make a lot of people express something similar to what you wrote, I consider myself an AVX-512 fan, not the other way around. It's the most important ISA extension since, well, I'm not sure: a long time (probably AVX and AVX/2 combined would have a similar impact).
It introduces a whole ton of stuff that is very powerful: full-width shuffles down to byte granularity with awesome performance, masking of every operation, often free, compress and expand operations, and a longer list at [1]. That's only from an integer angle too (what I care about).
Yeah, it's taken AVX-512 a while to get traction (the fact that generation after generation of new chips have just been Skylake client derivatives with no AVX-512 hasn't helped), but I hope we are reaching a turning point.
These transitions are something you have to deal with if you want max performance, and I think we'll come up with better models for how to make the "global" decision of whether you should be using AVX-512.
---
[1] https://branchfree.org/2019/05/29/why-ice-lake-is-important-...