The programming model is that all threads in the warp / thread block run the same instruction (barring masking for branch divergence). Having SIMD instructions at the thread level is a rarity given that the way SIMD is implemented is across warps / thread blocks (groups of warps). It does exist, but only within 32-bit words and really only for limited use cases, since the proper way to do SIMD on the GPU is by having all of the threads execute the same instruction:
Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.
That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.