I used to write software rasterizers in a past life. AVX-512 is straight-up an o...

berkut · on Dec 27, 2014

You only get those performance gains if you've got the cache and memory bandwidth - which is often very lacking.

E.g. AVX (8-wide) was added for SandyBridge, but it wasn't always that usable until Intel doubled the cache bandwidth with Ivy Bridge.

With ray tracing the increase in performance going from SSE (4-wide) to AVX (8-wide) for BVH intersection was only ~25-30% - instead of the theoretical 100% increase). You're generally limited by memory bandwidth.

wtallis · on Dec 27, 2014

Wider SIMD units may provide a linear performance increase for many workloads, but they have a super-linear impact on the up front cost of the chip, especially when you take into account the opportunity cost of those transistors: they could have been dedicated to something that might have also helped non-numerical workloads. Wide SIMD is great to have, but it doesn't come free, or else we wouldn't have GPUs.

sounds · on Dec 27, 2014

It may just be that Intel has abandoned this use case to the GPU. Which obviously is bad news for those of us who still rely on CPU SIMD.

But wtallis' argument to the contrary is quite compelling: https://news.ycombinator.com/item?id=8801808

coherentpony · on Dec 27, 2014

Does gcc support these new wider registers yet?

valarauca1 · on Dec 27, 2014

Yes and no. If you write the ASM you can use them, but it's code generation likely won't (but then I haven't checked for 2 months).

Most things that can be vectorizes will be placed in SSE rather then AVX. Also the GCC generally sucks at optimizing for SSE, or determining when code should use SSE as opposed to standard registers.

Generally speaking the LLVM backend does better SSE and vectorization code generation. But some think it used SSE to much/incorrectly.

So TLDR no

New wide registers are VERY new. The are barely supported, my processor has AVX2.0 and I have (still have to) one of these days set up perf properly because all of its fault codes aren't properly baked into the kernel yet (as of 3.17).

(Sorry for the lack of references)

Also the biggest draw back of AVX is they don't hold their state between context changes :/