You should use flash attention with KV cache quantization. I routinely use Qwen ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		wizee 4 months ago \| parent \| context \| favorite \| on: Smollm3: Smol, multilingual, long-context reasoner... You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).

sigmoid10 4 months ago [–]

>On my Pixel 8, I've successfully used Qwen 3 4B

How many tokens/s? I can't imagine that this would run in any practical way.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact