I hate to say it, but reasoning models simply aren't suited for edge computing. ...

wizee · 2025-07-09T14:30:29 1752071429

You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).

sigmoid10 · 2025-07-10T08:43:23 1752137003

>On my Pixel 8, I've successfully used Qwen 3 4B

How many tokens/s? I can't imagine that this would run in any practical way.