I dont get the point of helion as compared to its alternatives like gluon.
For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.
Funny, I feel the same way about Triton. Performant Triton looks like CUDA (but with tiles!) except it's ten times harder to debug since it doesn't have the tooling NVIDIA provides.
If I had to run on AMD I'd rather deal with their hipify tooling.
Performant Triton programs are usually simpler and shorter than their CUDA equivalents. This alone makes it easier to write, and I would argue that it helps with debugging too because the model provides a lot more guarantees on how your code executes. That said, some of the tooling is notably poor (such as cuda-gdb support).
Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.
Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.
For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.
It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.
I don’t think it is possible to write high performance code without understanding how the hardware works. I just think staring at code that coalesces your loads or swizzles your layouts for the hundredth time is a waste of screen space, though. Just let the compiler do it and when it gets it wrong then you can bust out the explicit code you were going to write in CUDA, anyway.
What's the point of Triton compared to Gluon? What's the point of PyTorch compared to Triton?
One of the main values of Triton is that it significantly expanded the scope of folks who can write kernels - I think Helion could expand the scope even more.
For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.