Look at Taichi at Github. This library for Python seems not very popular and unaware. Maybe, because it is a Chinese development, but Taichi is simple and compiles directly down to kernels on CUDA, GPU, Metal, Vulkan and has batteries included. Beats the fastest Mojo implementation of the Mandelbrot set about 260 times faster.
https://github.com/taichi-dev/taichi