Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How can you calculate required VRAM from precision and parameter number?


Realistically you probably just want to look at the file size on huggingface and add ~2 gigs for OS/Firefox tabs and and a bit for context (depends but lets say 1-2)

The direct parm conversion math tends to be much less reliable than one would expect once quants are involved.

e.g.

7B @ Q8 = 7.1gb [0]

30B @ Q8 = 34.6gb [1]

btw you can also roughly estimate expected output speed too if you know the device memory throughput. Noting that this doesn't work for MoEs

Also recently discovered that in CPU mode llama.cpp does memory mapping. For some models it loads less than a quarter into memory.

https://huggingface.co/TheBloke/Llama-2-7B-GGUF/tree/main

https://huggingface.co/TheBloke/LLaMA-30b-GGUF/tree/main


Rule of thumb is parameter_count * precision. Precision can be anything [32,16,8,4] bits. 32bits is sometimes used in training (although less now I guess), and rarely in inference. For a while now "full" precision is 16bit (fp16, bf16), fp8 is 8bit, int4 is 4bit, and so on. Everything that's not "full" precision is also known as quantised. fp8 is a quantised version of the "full" model.

So quick napkin math can give you the VRAM usage for loading the model. 7b can be ~14GB full, 7GB in fp8 and ~3.5GB in 4bit (AWQ, int4, q4_k_m, etc). But that's just to load the model in VRAM. You also need some available VRAM to run inference, and there are a lot of things to consider there too. You need to be able to run a forward pass on the required context, you can keep a kv cache to speed up inference, you can do multiple sessions in parallel, and so on.

Context length is important to take into account because images take a lot of tokens. So what you could do with a 7b LLM at full precision on a 16GB VRAM GPU might not be possible with a VLM, because the context of your query might not fit into the remaining 2GB.


A float16 is 2 bytes. 7B * 2 bytes = 14GB. I can't say if that's an accurate number, but that's almost certainly how tonii141 calculated it.


Oh, so FP16 means FloatingPoint16? I'm glad to learn something today, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: