From my experience llama.cpp doesn’t take full advantage of parallelism as it could. Tested this on an HPC cluster - increasing thread count certainly did increase CPU usage but did not meaningfully improve tok/s past 6-8 cores. Same behavior with whisper.cpp. :( I wonder if there’s another backend that scales better.
I'm guessing the problem is that you're constrained by memory bandwidth and not computation power and this is inherent to the algorithm, not an artifact of any one implementation.