Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hell, I'd love to be able to buy a $30k server to run these models. I think to run BLOOM required something more along the lines of a $200k server.


With code modifications, it should be possible to run this with a very modest machine as long as you're happy for performance to suck. Transformer models typically need to read all the weights per 'word' output, so if your model is 20GB and you have not enough ram or vram, but have an SSD that reads 1GB/sec, expect 3 words per minute output speed.

However, code changes are necessary to achieve that, although they won't be crazy complex.


There is a neat potential speedup here for the case where the bandwidth to your model weights is the limiting factor.

If you have a guess what the model will output, then you can verify that your guess is correct very cheaply, since you can do it in parallel.

That means there is the possibility to have a highly quantized small model in RAM, and then use the big model only from time to time. You might be able to get a 10x speedup this way if your small model agrees 90% of the time.


This is an interesting concept, could you share a paper or some writeup about this?


It looks like a description of Speculative Sampling. There's a recent paper from DeepMind about this in the context of LLM [0], although it's not a completely new idea of course.

The potential for speedup according to their paper is closer to 2x than 10x however.

0: https://arxiv.org/abs/2302.01318


The most time/cost optimal solution is probably to buy 32 or 64 gigs of ram. That'll still be slow but most people are already half way there.


Doesn't it need to be GPU ram?


They are saying you can run it on a CPU by doing this:

> However, code changes are necessary to achieve that, although they won't be crazy complex.

This is technically true. It will be very slow though.

However, give it 6 months and I think we might see an order of magnitude increase in speed on CPUs. This will still be too slow to be very useful though.


That will be very VERY slow. Pcie bandwidth is way too slow.


Should be like an order of magnitude faster than trying to run it from a NVMe still, no? I've ran some small flan models from RAM and it was fine, but yeah it's not exactly realtime.


I just tried this on the 7B model. Steady state single threaded CPU performance of 23 seconds/token on a Ryzen 5800x (I'm not sure why it's only using a single thread... usually these libraries automatically use more) and 14GB of ram. It used more than double that amount of ram while loading the model, and the first token took 183 seconds (potentially it's doing more work to parse the prompt that I'm not measuring properly).


Why 3 words per minute as opposed to second? Is that a typo? If you have enough RAM (but not VRAM), does it basically become limited by the PCIE lanes? So for the 112GB model with a Gen 5 GPU (64 GB/s PCIE bandwidth) that would be roughly 2 seconds per word right?


If read is 1GB/s then it takes 20s to infer across a 20GB model. That's 3 tokens a minute.


Yeah I'm not sure what kind of math I was subscribed to yesterday, thanks


:-) I keep saying that we don't have to stop AI from hallucinating, we only need to bring the rate to below human level.


True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.


I've tried this but compared to ChatGPT... let's just say it's in a different league.


No need to spend $30k, use Azure or AWS.


Yep. It’s expensive to spin up an A100 80GB instance but not THAT expensive. Oracles cloud offering (first thing to show up in google search I know you probably won’t use them and it seems extra expensive) is $4.00 per hour. If you are motivated to screw around with this stuff there’s definitely options.


> If you are motivated to screw around with this stuff there’s definitely options.

Erm, for inference that is. Training is definitely out of question for individuals I believe (unless you use much smaller models?).


GCP spot price for A100 80g gpu is only $1.25 and they give you $300 of credit when you open a new acc


Unless its for something you want to happen whenever and don't mind be dumped in process, shouldn't we look at on-demand, not spot, prices?


It’s fine for testing it out or serving short queries. Your data can still remain on PV if the vm gets yanked


I can see some issues with uploading a leaked model to a cloud provider.


you can - slowly - run Bloom 3b and 7b1 on the free (trial) tiers of Google Cloud Compute if you use the low_cpu_mem_usage parameter of from_pretrained




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: