Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know where you've got those numbers, but they're wrong.

https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/inferen... seems comparable to the Framework Desktop and reputable - they didn't just quote a number, they showed benchmark output.

I get far more than 3 t/s for a 70B model on normal non-unified RAM, so that's completely unfeasible performance for a unified memory architecture like Halo.



It depends on the model.

It's typically ok for MoE models but if you try to run something non-MoE the speed will plummet. In that same thread there are people getting 50 tok/s on MoE models and 5 on non MoE. (https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/comment...)

And while it has unified memory the memory is quite slow. 250GB/s compared to 500+ for M4 Max or 1800 GB/s for a 5090. So it's fast for a CPU, but pretty slow for a GPU.

(That said, there are not a lot of cheap options for running large models locally. They all have significant compromises.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: