Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I hate to say it, but reasoning models simply aren't suited for edge computing. I just ran some tests on this model and even at 4bit weight quantisation it blows past 10GB of VRAM with just ~1000 tokens while it is still reasoning. So even if you're running on a dedicated ML edge device like a $250 Jetson, you will run out of memory before the model even formulates a real answer. You'll need a high end GPU to make full use of it for limited answers and an enterprise grade system to support longer contexts. And with reasoning turned off I don't see any meaningful improvement over older models.

So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.



You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).


>On my Pixel 8, I've successfully used Qwen 3 4B

How many tokens/s? I can't imagine that this would run in any practical way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: