GPUs do deal with memory in other ways than parallelism! That's why GPUs tend to offer huge register files (up to 256 architectural registers per thread on RDNA1!) and local memories (up to 64KB LDS per workgroup on RDNA1). This means lots of work can be done purely in registers and LDS, and trips to global memory are way rarer compared to CPUs where global memory contain everything except 16 or so architectural registers.
Even then, global memory is an issue. And not just because of latency, but also bandwidth! That's why RDNA2 and Ada both added a ton of last-level cache, not to better hide latency (although it's a welcome addition), but mostly as a 'bandwidth amplifier'.
I think register files, shared memory and even texture memory patterns are basically the GPU's way of having fully programmable caches, as opposed to cache being a grey box like on CPUs. Overall I think for HPC code it works out better. Looking at vectorized and cache-optimized CPU code gives me the shivers...
Mostly due to the above I think Intel's years long push to try to have people just run their CPU-optimized OpenMP codes on their Xeon Phis was the downfall of that architecture - it just can't win against a data-parallel-first framework like CUDA, mainly because there the data parallel compute thread can be programmed just as such, rather than trying to get the compiler to behave exactly as you want (and in the end giving up and dropping down to assembly-like SSE instructions).
I wonder why Intel didn't take the OpenCL route farther. Even plain-vanilla, and certainly with some vendor extensions, it could have made it quite easy for programmers to manage to get the hardware to play nicely.
Also, on an unrelated note: IIANM I believe Intel now sorta-kinda offers the ability to set some cache aside to be close to fully programmable. But I forgot exactly what the mechanism for that is called.
You mean for CPUs? Check out ISPC, it's essentially a lightweight data-parallel C for CPUs (that has been ported to Intel GPUs too!), and I think also works fine on even some non-x86 architectures. I'm not sure if there's any mature CPU OpenCL runtimes.
Intel also still supports OpenCL as a first-class target on their GPUs, which is more than I can say for Nvidia or AMD.
OpenCL was a decent effort, but without Nvidia on board it was IMO too little too late - by the time that became a thing, NVDA already was entrenched in the accelerator compute software space. Even worse when Intel/AMD didn't offer a good story to also target CPU vector units with it - that would have created a strong enough ecosystem to target.
Even then, global memory is an issue. And not just because of latency, but also bandwidth! That's why RDNA2 and Ada both added a ton of last-level cache, not to better hide latency (although it's a welcome addition), but mostly as a 'bandwidth amplifier'.