I was about to post something similar. While the research is interesting, it doe...

timschmidt · 2025-04-06T10:50:16 1743936616

> it doesn’t offer any advantages over 3- or 4-bit quantization.

"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."

My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.

kingsleyopara · 2025-04-06T11:02:55 1743937375

For zero-shot accuracy from Table 3:

* LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79

* LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68

These results seem comparable to modern quantization methods—for example, the ~4-bit results for smaller LLaMA models listed here: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

timschmidt · 2025-04-06T11:17:07 1743938227

I don't see any comparable numbers on the page you linked. Seems to only have numbers for 1B and 3B parameter models. Comparisons to AWQ and OmniQuant in Table 3 seem quite favorable with SeedLM showing 10% - 50% better performance.

Also seems like the techniques may be possible to combine.

_0ffh · 2025-04-06T17:27:05 1743960425

As a rule of thumb, the bigger the model is, the more graciously it degrades under quantisation. So you may assume performance loss for a 8B model would be lower than for a 3B model. (I know that doesn't make up for missing numbers in link, just fyi.)

jsenn · 2025-04-06T13:49:26 1743947366

I think the main advantage is that you can compute the extra parameters (the PRNG seeds) from the network weights alone, whereas most other quantization methods require simulating the quantization procedure at training time (Quantization-Aware Training) or setting them from a calibration dataset (Post-Training Quantization)

hedgehog · 2025-04-06T17:59:55 1743962395

This technique has three significant advantages over popular low bit quantization: 1) it retains more accuracy, 2) it does not require calibration data, 3) it's easier to implement in hardware.