Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Something I haven't figured out: should I think about these memory requirements as comparable to the baseline memory an app uses, or like per-request overhead? If I needed to process 10 prompts at once, do I need 10x those memory figures?


It’s like a database, I imagine - so the answer is probably “unlikely,” that you need memory per-request but instead that you run out of cores to handle requests?

You need to load the data so the graphics cards - where the compute is - can use it to answer queries. But you don’t need a separate copy of the data for each GPU core, and though slower, cards can share RAM. And yet even with parallel cores, your server can only answer or process so many queries at a time before it runs out of compute resources. Each query isn’t instant either given how the GPT4 answers stream in real-time yet still take a minute or so. Plus the way the cores work, it likely takes more than one core to answer a given question, likely hundreds of cores computing probabilities in parallel or something.

I don’t actually know any of the details myself, but I did do some CUDA programming back in the day. The expensive part is often because the GPU doesn’t share memory with the CPU, and to get any value at all from the GPU to process data at speed you have to transfer all the data to GPU RAM before doing anything with the GPU cores…

Things probably change quite a bit with a system on a chip design, where memory and CPU/GPU cores are closer, of course. The slow part for basic replacement of CPU with GPU always seemed to be transferring data to the GPU, hence why some have suggested the GPU be embedded directly on the motherboard, replacing it, and just put the CPU and USB on the graphics card directly.

Come to think of it, an easier answer is how much work can you do in parallel on your laptop before you need another computer to scale the workload? It’s probably like that. It’s likely that requests take different amounts of computation - some words might be easier to compute than others, maybe data is local and faster to access or the probability is 100% or something. I bet it’s been easier to use the cloud to toss more machines at the problem than to work on how it might scale more efficiently too.


Does that mean an iGPU would be better than a dGPU? A beefier version than those of today though.


Sort of. The problem with most integrated GPUs is that they don’t have as many dedicated processing cores and the RAM, shared with the system, is often slower than on dedicated graphics cards. Also… with the exception of system on a chip designs, traditional integrated graphics reserved a chunk of memory for graphics use and still had to copy to/from it. I believe with newer system-on-a-chip designs we’ve seen graphics APIs e.g. on macOS that can work with data in a zero-copy fashion. But the trade off between fewer, larger system integrated graphics cores vs the many hundreds or thousands or tens of thousands of graphics cores, well, lots of cores tends to scale better than fewer. So there’s a limit to how far two dozen beefy cores can take you vs tens of thousands of dedicated tiny gfx cores.

The theoretical best approach would be to integrate lots of GPU cores on the motherboard alongside very fast memory/storage combos such as Octane, but reality is very different because we also want portable, replaceable parts and need to worry about silly things like cooling trade offs between placing things closer for data efficiency vs keeping things spaced apart enough so the metal doesn’t melt from the power demands in such a small space. And whenever someone says “this is the best graphics card,” someone inevitably comes up with a newer arrangement of transistors that is even faster.


You need roughly (model size + (n * (prompt + generated text)) where n. Is the number of parallel users/ request.


It should be noted that that last part has a pretty large factor to it that also scales with model size, because to run transformers efficiently you cache some of the intermediate activations from the attention block.

The factor is basically 2 * number of layers * number of embeddings values (e.g. fp16) that are stored per token.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: