LLM inference is bottlenecked by memory bandwidth. You'll probably get identical...

jchw · on Dec 11, 2023

I'd like to see some benchmarks. For one thing, I suspect you'd at least want an X3D model for AMD, due to the better cache. But for another, at least according to top, llama.cpp does seem to manage to saturate all of the cores during inference. (Although I didn't try messing around much; I know X3D CPUs do not give all cores "3D V-Cache" so it's possible that limiting inference to just those cores would be beneficial.)

For me it's OK though, since I want faster compile times anyway, so it's worth the money. To me local LLMs are just a curiosity.

edit: Interesting information here. https://old.reddit.com/r/LocalLLaMA/comments/14ilo0t/extensi...

> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

You'd really expect DDR5-6000 to be advantageous. I think that AMD Ryzen 7xxx can at least take advantage to up to 5600. Does it perhaps not wind up bottlenecking on memory? Maybe quantization plays a role...

my123 · on Dec 11, 2023

The big cache is irrelevant for this use case. You're memory bandwidth bound, with a substantial portion of the model read for each token, so that a 128MB cache doesn't help.

mrob · on Dec 11, 2023

>> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

That's referring specifically to prompt processing, which uses a batch processing optimization not used in normal inference. The processed prompt can also be cached so you only need to process it again if you change it. Normal inference benefits from faster RAM.

dannyw · on Dec 11, 2023

Yep, get the fastest memory you can.

I wish there were affordable platforms with quad DDR5.

milkcr4t3 · on Dec 11, 2023

The cache size of those 3d CPUs should definitely play some sort of role.

I can only speculate that it would help mitigate latency with loose timings on a fast OC among other things.