The trick of llama.cpp and our dynamic quants is you can actually offload the model to RAM / even an SSD! If you have GPU VRAM + RAM + SSD > the model size (say 90GB for dynamic 2bit quant), then it'll run well!
Ie you can actually run it on a local desktop or even your laptop now! You don't need a 90GB GPU for example, but say a 24GB GPU + 64GB to 128GB RAM.
Ie you can actually run it on a local desktop or even your laptop now! You don't need a 90GB GPU for example, but say a 24GB GPU + 64GB to 128GB RAM.
The speeds are around 3 to 5 tokens / second, so still ok! I write more about improving speed for local devices here: https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tun...