Got it. I’ll review the paper again for that portion. However, it still sounds l...

liuliu · 2025-01-27T18:53:40 1738004020

Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster.

If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.

rahimnathwani · 2025-01-28T02:52:15 1738032735

  we already do FP8 for inference

Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.