Funny, I feel the same way about Triton. Performant Triton looks like CUDA (but with tiles!) except it's ten times harder to debug since it doesn't have the tooling NVIDIA provides.
If I had to run on AMD I'd rather deal with their hipify tooling.
Performant Triton programs are usually simpler and shorter than their CUDA equivalents. This alone makes it easier to write, and I would argue that it helps with debugging too because the model provides a lot more guarantees on how your code executes. That said, some of the tooling is notably poor (such as cuda-gdb support).
Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.
Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.
For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.
It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.
I don’t think it is possible to write high performance code without understanding how the hardware works. I just think staring at code that coalesces your loads or swizzles your layouts for the hundredth time is a waste of screen space, though. Just let the compiler do it and when it gets it wrong then you can bust out the explicit code you were going to write in CUDA, anyway.
If I had to run on AMD I'd rather deal with their hipify tooling.