Yeah, and I think people forget all the time that inference (usually batch_size=...

Yeah, and I think people forget all the time that inference (usually batch_size=1) is memory bandwidth bound, but training (usually batch_size=large) is usually compute bound. And people use enormous batch sizes for training.

And while the Mac Studio has a lot of memory bandwidth compared to most desktops CPUs, it isn't comparable to consumer GPUs (the 3090 has a bandwidth of ~936GBps) let alone those with HBM.

I really don't hear about anyone training on anything besides NVIDIA GPUs. There are too many useful features like mixed-precision training, and don't even get me started on software issues.