Not sure about DeepSeek R1, but you are right in regards to previous MoE archite...

andrewgross · 2025-01-27T03:20:03 1737948003

Is there a concept of an expert that persists across layers? I thought each layer was essentially independent in terms of the "experts". I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though.

I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.

rahimnathwani · 2025-01-27T07:08:14 1737961694

  I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though

Yes, I think that's what they describe in section 3.4 of the V3 paper. Section 2.1.2 talks about "token-to-expert affinity". I think there's a layer which calculates these affinities (between a token and an expert) and then sends the computation to the GPUs with the right experts.

This doesn't sound like it would work if you're running just one chat, as you need all the experts loaded at once if you want to avoid spending lots of time loading and unloading models. But at scale with batches of requests it should work. There's some discussion of this in 2.1.2 but it's beyond my current ability to comprehend!

andrewgross · 2025-01-27T14:02:35 1737986555

Ahh got it, thanks for the pointer. I am surprised there is enough correlation there to allow an entire GPU to be specialized. I'll have to dig in to the paper again.

liuliu · 2025-01-27T15:00:41 1737990041

It does. They have 256 experts per MLP layer, and some shared ones. The minimal deployment for decoding (aka. token generation) they recommend is 320 GPUs (H800). It is all in the DeepSeek v3 paper that everyone should read rather than speculating.

andrewgross · 2025-01-27T17:18:47 1737998327

Got it. I’ll review the paper again for that portion. However, it still sounds like the end result is not VRAM savings but efficiently and speed improvements.

liuliu · 2025-01-27T18:53:40 1738004020

Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster.

If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.

rahimnathwani · 2025-01-28T02:52:15 1738032735

  we already do FP8 for inference

Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.

Kubuxu · 2025-01-27T14:25:23 1737987923

I don't think entire GPU is specialised nor a singular token will use the same expert. I think about it as a gather-scatter operation at each layer.

Let's say you have an inference batch of 128 chats, at layer `i` you take the hidden states, compute their routing, scatter them along with the KV for those layers among GPUs (each one handling different experts), the attention and FF happens on these GPUs (as model params are there) and they get gathered again.

You might be able to avoid the gather by performing the routing on each of the GPUs, but I'm generally guessing here.

rahimnathwani · 2025-01-27T06:59:39 1737961179

  If you place experts in different GPUs

Right, this is described in the Deepseek V3 paper (section 3.4 on pages 18-20).