You're approaching this from a developers point of view.
Users absolutely don't care if their prompt response has been generated by a CUDA kernel or some poorly documented apple specific silicon a poor team at cupertino almost lost their sanity to while porting the model.
And haven't they already spent quite a bit on money on their pytorch-like MLX framework?
> Users absolutely don't care if their prompt response has been generated by a CUDA kernel or some poorly documented apple specific silicon
They most certainly will. If you run GPT-4o on an iPhone with MLX, it will suck. Users will tell you it sucks, and they won't do so in developer-specific terms.
The entire point of this thread is that Apple can't make users happy with their Neural Engine. They require a stopgap cloud solution to make up for the lack of local power on iPhone.
> And haven't they already spent quite a bit on money on their pytorch-like MLX framework?
As well as Accelerate Framework, Metal Performance Shaders and previously, OpenCL. Apple can't decide where to focus their efforts, least of which in a way that threatens CUDA as a platform.
Users absolutely don't care if their prompt response has been generated by a CUDA kernel or some poorly documented apple specific silicon a poor team at cupertino almost lost their sanity to while porting the model.
And haven't they already spent quite a bit on money on their pytorch-like MLX framework?