GPUs do deal with memory in other ways than parallelism! That's why GPUs tend to...

m_mueller · on Dec 18, 2023

I think register files, shared memory and even texture memory patterns are basically the GPU's way of having fully programmable caches, as opposed to cache being a grey box like on CPUs. Overall I think for HPC code it works out better. Looking at vectorized and cache-optimized CPU code gives me the shivers...

Mostly due to the above I think Intel's years long push to try to have people just run their CPU-optimized OpenMP codes on their Xeon Phis was the downfall of that architecture - it just can't win against a data-parallel-first framework like CUDA, mainly because there the data parallel compute thread can be programmed just as such, rather than trying to get the compiler to behave exactly as you want (and in the end giving up and dropping down to assembly-like SSE instructions).

einpoklum · on Dec 18, 2023

I wonder why Intel didn't take the OpenCL route farther. Even plain-vanilla, and certainly with some vendor extensions, it could have made it quite easy for programmers to manage to get the hardware to play nicely.

Also, on an unrelated note: IIANM I believe Intel now sorta-kinda offers the ability to set some cache aside to be close to fully programmable. But I forgot exactly what the mechanism for that is called.

ColonelPhantom · on Dec 18, 2023

You mean for CPUs? Check out ISPC, it's essentially a lightweight data-parallel C for CPUs (that has been ported to Intel GPUs too!), and I think also works fine on even some non-x86 architectures. I'm not sure if there's any mature CPU OpenCL runtimes.

Intel also still supports OpenCL as a first-class target on their GPUs, which is more than I can say for Nvidia or AMD.

m_mueller · on Dec 18, 2023

OpenCL was a decent effort, but without Nvidia on board it was IMO too little too late - by the time that became a thing, NVDA already was entrenched in the accelerator compute software space. Even worse when Intel/AMD didn't offer a good story to also target CPU vector units with it - that would have created a strong enough ecosystem to target.

soundarana · on Dec 18, 2023

CPUs have been growing their register files too - AVX512 has 32 x 512 bits registers.

einpoklum · on Dec 18, 2023

Two valid points, +1.