With code modifications, it should be possible to run this with a very modest machine as long as you're happy for performance to suck. Transformer models typically need to read all the weights per 'word' output, so if your model is 20GB and you have not enough ram or vram, but have an SSD that reads 1GB/sec, expect 3 words per minute output speed.
However, code changes are necessary to achieve that, although they won't be crazy complex.
There is a neat potential speedup here for the case where the bandwidth to your model weights is the limiting factor.
If you have a guess what the model will output, then you can verify that your guess is correct very cheaply, since you can do it in parallel.
That means there is the possibility to have a highly quantized small model in RAM, and then use the big model only from time to time. You might be able to get a 10x speedup this way if your small model agrees 90% of the time.
It looks like a description of Speculative Sampling. There's a recent paper from DeepMind about this in the context of LLM [0], although it's not a completely new idea of course.
The potential for speedup according to their paper is closer to 2x than 10x however.
They are saying you can run it on a CPU by doing this:
> However, code changes are necessary to achieve that, although they won't be crazy complex.
This is technically true. It will be very slow though.
However, give it 6 months and I think we might see an order of magnitude increase in speed on CPUs. This will still be too slow to be very useful though.
Should be like an order of magnitude faster than trying to run it from a NVMe still, no? I've ran some small flan models from RAM and it was fine, but yeah it's not exactly realtime.
I just tried this on the 7B model. Steady state single threaded CPU performance of 23 seconds/token on a Ryzen 5800x (I'm not sure why it's only using a single thread... usually these libraries automatically use more) and 14GB of ram. It used more than double that amount of ram while loading the model, and the first token took 183 seconds (potentially it's doing more work to parse the prompt that I'm not measuring properly).
Why 3 words per minute as opposed to second? Is that a typo? If you have enough RAM (but not VRAM), does it basically become limited by the PCIE lanes? So for the 112GB model with a Gen 5 GPU (64 GB/s PCIE bandwidth) that would be roughly 2 seconds per word right?
Yep. It’s expensive to spin up an A100 80GB instance but not THAT expensive. Oracles cloud offering (first thing to show up in google search I know you probably won’t use them and it seems extra expensive) is $4.00 per hour. If you are motivated to screw around with this stuff there’s definitely options.
you can - slowly - run Bloom 3b and 7b1 on the free (trial) tiers of Google Cloud Compute if you use the low_cpu_mem_usage parameter of from_pretrained