Hell, I'd love to be able to buy a $30k server to run these models. I think to r...

londons_explore · on March 3, 2023

With code modifications, it should be possible to run this with a very modest machine as long as you're happy for performance to suck. Transformer models typically need to read all the weights per 'word' output, so if your model is 20GB and you have not enough ram or vram, but have an SSD that reads 1GB/sec, expect 3 words per minute output speed.

However, code changes are necessary to achieve that, although they won't be crazy complex.

londons_explore · on March 3, 2023

There is a neat potential speedup here for the case where the bandwidth to your model weights is the limiting factor.

If you have a guess what the model will output, then you can verify that your guess is correct very cheaply, since you can do it in parallel.

That means there is the possibility to have a highly quantized small model in RAM, and then use the big model only from time to time. You might be able to get a 10x speedup this way if your small model agrees 90% of the time.

humanizersequel · on March 3, 2023

This is an interesting concept, could you share a paper or some writeup about this?

ebalit · on March 3, 2023

It looks like a description of Speculative Sampling. There's a recent paper from DeepMind about this in the context of LLM [0], although it's not a completely new idea of course.

The potential for speedup according to their paper is closer to 2x than 10x however.

0: https://arxiv.org/abs/2302.01318

moffkalast · on March 3, 2023

The most time/cost optimal solution is probably to buy 32 or 64 gigs of ram. That'll still be slow but most people are already half way there.

esperent · on March 3, 2023

Doesn't it need to be GPU ram?

nl · on March 3, 2023

They are saying you can run it on a CPU by doing this:

> However, code changes are necessary to achieve that, although they won't be crazy complex.

This is technically true. It will be very slow though.

However, give it 6 months and I think we might see an order of magnitude increase in speed on CPUs. This will still be too slow to be very useful though.

koheripbal · on March 3, 2023

That will be very VERY slow. Pcie bandwidth is way too slow.

moffkalast · on March 3, 2023

Should be like an order of magnitude faster than trying to run it from a NVMe still, no? I've ran some small flan models from RAM and it was fine, but yeah it's not exactly realtime.

gpm · on March 3, 2023

I just tried this on the 7B model. Steady state single threaded CPU performance of 23 seconds/token on a Ryzen 5800x (I'm not sure why it's only using a single thread... usually these libraries automatically use more) and 14GB of ram. It used more than double that amount of ram while loading the model, and the first token took 183 seconds (potentially it's doing more work to parse the prompt that I'm not measuring properly).

bick_nyers · on March 3, 2023

Why 3 words per minute as opposed to second? Is that a typo? If you have enough RAM (but not VRAM), does it basically become limited by the PCIE lanes? So for the 112GB model with a Gen 5 GPU (64 GB/s PCIE bandwidth) that would be roughly 2 seconds per word right?

flangola7 · on March 3, 2023

If read is 1GB/s then it takes 20s to infer across a 20GB model. That's 3 tokens a minute.

bick_nyers · on March 4, 2023

Yeah I'm not sure what kind of math I was subscribed to yesterday, thanks

flangola7 · on March 4, 2023

:-) I keep saying that we don't have to stop AI from hallucinating, we only need to bring the rate to below human level.

VadimPR · on March 3, 2023

True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.

sourcecodeplz · on March 3, 2023

I've tried this but compared to ChatGPT... let's just say it's in a different league.

DeathArrow · on March 3, 2023

No need to spend $30k, use Azure or AWS.

wincy · on March 3, 2023

Yep. It’s expensive to spin up an A100 80GB instance but not THAT expensive. Oracles cloud offering (first thing to show up in google search I know you probably won’t use them and it seems extra expensive) is $4.00 per hour. If you are motivated to screw around with this stuff there’s definitely options.

gnramires · on March 3, 2023

> If you are motivated to screw around with this stuff there’s definitely options.

Erm, for inference that is. Training is definitely out of question for individuals I believe (unless you use much smaller models?).

dilyevsky · on March 4, 2023

GCP spot price for A100 80g gpu is only $1.25 and they give you $300 of credit when you open a new acc

dragonwriter · on March 4, 2023

Unless its for something you want to happen whenever and don't mind be dumped in process, shouldn't we look at on-demand, not spot, prices?

dilyevsky · on March 4, 2023

It’s fine for testing it out or serving short queries. Your data can still remain on PV if the vm gets yanked

vintermann · on March 3, 2023

I can see some issues with uploading a leaked model to a cloud provider.

permo-w · on March 3, 2023

you can - slowly - run Bloom 3b and 7b1 on the free (trial) tiers of Google Cloud Compute if you use the low_cpu_mem_usage parameter of from_pretrained