The programming model is that all threads in the warp / thread block run the sam...

Agentlien · on Feb 10, 2025

I know all of that. I was talking about RDNA2, which is AMD. There, instructions come in two flavours:

1. Scalar - run once per thread group, only acting on shared memory. So these won't be SIMD.

2. Vector - run across all threads, each threads accesses its own copy of the variables. This is what you typically think of GPU instructions doing.

LegionMammal978 · on Feb 10, 2025

That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.

[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 259, Table 83, "VOP3A Opcodes"

[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 160, Table 85, "VOP3 Opcodes"