I used to write software rasterizers in a past life. AVX-512 is straight-up an observed 8x performance gain over SSE: a well-written software rasterizer is a very clever thread-scheduling algorithm wrapped around long lists of FMA-load/store sequences.
I know that ray-tracers, databases, and other 'big data' all gain equally-well from AVX-512. The IO-request depth on these new parts is such that, for ALU-heavy work, with good IO-spread, memory latency is completely hidden---fetch time is still ~250-300c, but you just no longer see it.
The only thing I miss is 'addsets', which was dropped from LRBni, as this instruction was 'the rasterizer function': it now requires a fairly involved sequence to replicate in AVX-512.
Pining for other lost things: if we had the LRBni up- and down- convert instructions, a software texture sampler would be a lot more feasible.
You only get those performance gains if you've got the cache and memory bandwidth - which is often very lacking.
E.g. AVX (8-wide) was added for SandyBridge, but it wasn't always that usable until Intel doubled the cache bandwidth with Ivy Bridge.
With ray tracing the increase in performance going from SSE (4-wide) to AVX (8-wide) for BVH intersection was only ~25-30% - instead of the theoretical 100% increase). You're generally limited by memory bandwidth.
Wider SIMD units may provide a linear performance increase for many workloads, but they have a super-linear impact on the up front cost of the chip, especially when you take into account the opportunity cost of those transistors: they could have been dedicated to something that might have also helped non-numerical workloads. Wide SIMD is great to have, but it doesn't come free, or else we wouldn't have GPUs.
Yes and no. If you write the ASM you can use them, but it's code generation likely won't (but then I haven't checked for 2 months).
Most things that can be vectorizes will be placed in SSE rather then AVX. Also the GCC generally sucks at optimizing for SSE, or determining when code should use SSE as opposed to standard registers.
Generally speaking the LLVM backend does better SSE and vectorization code generation. But some think it used SSE to much/incorrectly.
So TLDR no
New wide registers are VERY new. The are barely supported, my processor has AVX2.0 and I have (still have to) one of these days set up perf properly because all of its fault codes aren't properly baked into the kernel yet (as of 3.17).
(Sorry for the lack of references)
Also the biggest draw back of AVX is they don't hold their state between context changes :/
I know that ray-tracers, databases, and other 'big data' all gain equally-well from AVX-512. The IO-request depth on these new parts is such that, for ALU-heavy work, with good IO-spread, memory latency is completely hidden---fetch time is still ~250-300c, but you just no longer see it.
The only thing I miss is 'addsets', which was dropped from LRBni, as this instruction was 'the rasterizer function': it now requires a fairly involved sequence to replicate in AVX-512.
Pining for other lost things: if we had the LRBni up- and down- convert instructions, a software texture sampler would be a lot more feasible.