You are almost right: \* The routing happens in every feedforward layer (32 of t...

miven · on Dec 11, 2023

I wonder what would be the most efficient tactic for offloading select layers of such a model to a GPU within a memory-constrained system

As far as I understand usually layer offloading in something like llama.cpp loads the first few consecutive layers to VRAM (the remainder being processed in the CPU) such that you don't have too much back and forth between the CPU and GPU.

I feel like such an approach would lead to too much wasted potential in terms of GPU work when applied to a SMoE model, but on the other hand offloading non-consecutive layers and bouncing between the two processing units too often may be even slower...

michaelt · on Dec 11, 2023

As I understand things, these LLMs are mostly constrained by memory bandwidth. A respectable desktop CPU like the Intel Core i9-13900F has a memory bandwidth of 89.6 GB/s [1]

An nvidia 4090 has a memory bandwidth of 1008 GB/s [2] i.e. 11x as much.

Using these together is like a parcel delivery which goes 10 miles by formula 1 race car, then 10 miles on foot. You don't want the race car or the handoff to go wrong, but in terms of the total delivery time they're insignificant compared to the 10 miles on foot.

I'm not sure there's much potential for cleverness here, unless someone trains a model specifically targeting this use case.

[1] https://www.intel.com/content/www/us/en/products/sku/230502/... [2] https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-GPU-Be...

bloopernova · on Dec 11, 2023

Kind of exciting to think how much faster memory may soon become. Especially with Apple M series for AMD and Intel to compete with for AI workloads.

65a · on Dec 11, 2023

FWIW, server parts from Intel and AMD are already pretty fast, e.g. octo-channel Sapphire Rapids does something on the order of 300GB/s: https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...

Jackson__ · on Dec 11, 2023

>if you are VRAM constrained

So this is a perfect model architecture for the alternate realities where nvidia decided to scale up VRAM instead of compute first? I'll let them know over trans-dimensional text message.

Also if quantization scales similar per 7b expert as seen in dense LLMs, i.e. the bigger the model, the lower the perplexity loss, this could be the worst performing model at <=4bits compared to anything else currently available :(

-A very sad 24gb 3090 user.

sebzim4500 · on Dec 12, 2023

MoE is a great architecture if you are running the model at scale. When you put different layers on different machines, the VRAM used for the parameters doesn't matter that much but the inference compute really does.

That's why the SOTA proprietary models are probably all MoE (GPT-3.5/4, palm, gemini, etc.) but until recently no open models were.

gorbypark · on Dec 11, 2023

Yeah, now that you say it, it does make sense that all of the params would need to be loaded into VRAM (otherwise it's be really slow swapping between models all the time). I guess the tokens per second would be super fast when comparing inference on a 12B and 45B model, though.