A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

michaelanckaert · 2025-09-24T10:36:01 1758710161

You need memory on the GPU, not in the system itself (unless you have unified memory such as the M-architecture). So we're talking about cards like the H200 that have 141GB of memory and cost between 25 to 40k.

Borealid · 2025-09-24T10:45:14 1758710714

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

michaelanckaert · 2025-09-24T11:15:03 1758712503

I didn't glace at it, I read it :-) The architecture is a 'unified memory bus', so yes the GPU has access to that memory.

My comment was a bit unfortunate as it implied I didn't agree with yours, sorry for that. I simply want to clarify that there's a difference between 'GPU memory' and 'system memory'.

The Frame.work desktop is a nice deal. I wouldn't buy the Ryzen AI+ myself, from what I read it maxes out at about 60 tokens / sec which is low for my use cases.

ramon156 · 2025-09-24T11:23:13 1758712993

These don't run 200B models at all, results show it can run 13B at best. 70B is ~3 tk / s according to someone on Reddit.

Borealid · 2025-09-24T12:16:12 1758716172

I don't know where you've got those numbers, but they're wrong.

https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/inferen... seems comparable to the Framework Desktop and reputable - they didn't just quote a number, they showed benchmark output.

I get far more than 3 t/s for a 70B model on normal non-unified RAM, so that's completely unfeasible performance for a unified memory architecture like Halo.

mhast · 2025-09-25T05:38:16 1758778696

It depends on the model.

It's typically ok for MoE models but if you try to run something non-MoE the speed will plummet. In that same thread there are people getting 50 tok/s on MoE models and 5 on non MoE. (https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/comment...)

And while it has unified memory the memory is quite slow. 250GB/s compared to 500+ for M4 Max or 1800 GB/s for a 5090. So it's fast for a CPU, but pretty slow for a GPU.

(That said, there are not a lot of cheap options for running large models locally. They all have significant compromises.)