Hacker Newsnew | past | comments | ask | show | jobs | submit | theanonymousone's commentslogin

I hope Over-16s be next.

In OpenRouter, there is an "int4" tag for Moonshot provider of Kimi K2. 7 Code. Isn't that too low, particularly coming from the very developer of the model? Os that a mistake? How is it in their direct API offer?

The model is natively quantized (i.e. it was trained that way in the first place, so this is not a post-training quantization which degrades performance).

Isn't it not completely quantized? I thought there were some dense parts but most is int4?

Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.

But the huggingface link mentions BF16, F16, and I32?

Not every weight is quantized. For example, those weights which don't take much space or are highly important are left in higher precision. State-of-art quantization of weights is never done uniformly (i.e. to all weights and in the same way).

I don't believe safetensors has a native int4 dtype, so they packed 4 int4s into a bf16 in this checkpoint.

"You're absolutely right"?

"You hit the nail on the head" LOL

For me it is a tool I avail to an LLM so that it can provide correct answers to a certain category of questions, instead of hallucinating nonsense.

P.S. I was casually searching for "sandboxed Python" for an experiment I'm working on, and reached this article that was published "today". Very nice coincidence! Thanks.

I fully agree with the title (with reservation for "dialects"), and I believe the same can be said for JSON and Markdown, among possibly others.

Have you seen the 8bit quantisation matter a lot? The "consensus" in r/LocalLlama is that up to 4 bits the loss is tolerable.


Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.

I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.


Thanks a lot. How about Q8 vs FP16/BF16? Have you checked them too?


I have not spent a lot of time running FP16 'full precision' versions of some things, but as the other commenter says, it's not much difference. There's a really wide array of benchmarks and tests from a lot of third parties unrelated to the trainer of the AI models that shows at most a two percent difference in score and capability between BF16 and Q8.


Q8 quant is very minimal fall off in terms of KLD against the lab 16 bit. If you have the memory for BF16 KV-cache (which is usually easier to stomach) then the Q8 is very close. But even Q8 quant model with Q8 KV-cache is very close.

Smaller quants for the model start to fall off but more importantly, smaller KV-cache quants fall off much faster so avoid less than Q8 there.


It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6


My question as well. Isn't Tencent a very well-known company? Maybe the mystery is in the model itself?


This is a big deal when/if it's working, to me at least. Where can I contribute?


https://github.com/evmar/theseus

Looks like just enough was supported to run minesweeper. Impressive though.


Isn't this link a duplicate? Or I have déjà vu?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: