Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.
> Memory bandwidth for inferencing does not scale with the number of GPU
It does