The trick of llama.cpp and our dynamic quants is you can actually offload the mo...

The trick of llama.cpp and our dynamic quants is you can actually offload the model to RAM / even an SSD! If you have GPU VRAM + RAM + SSD > the model size (say 90GB for dynamic 2bit quant), then it'll run well!

Ie you can actually run it on a local desktop or even your laptop now! You don't need a 90GB GPU for example, but say a 24GB GPU + 64GB to 128GB RAM.

The speeds are around 3 to 5 tokens / second, so still ok! I write more about improving speed for local devices here: https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tun...