I actually tried the 4 bit quants (Q4_K_M) and was a bit unimpressed. Switching to Q6_K made a huge difference, but it doesn't fit on my 3090 so it was very slow. And testing on perplexity's website which I presume is fp16 seemed even better, although that might be mostly due to sampler/prompt differences.
A lot of things are being getting right if you look at the issues at ggerganov’s repo.
To say anything as general as ‘the new file format is broken’ just means you either don’t understand the project basics or do not follow closely the commits.
So? Doesn’t mean that the moment we are using it, the format isn’t broken. I didn’t say that it wouldn’t be fixed in the future. The reality is, the current 4 bit GGUF are giving us subpar results compared to other quantization method. It’s not a helpful comment telling me that “I don’t understand the basic” rather than telling me the exact flags we should using or it’s being fixed.
Inference with llama.cop is not trivial and I can’t summarise in one post all of them parameters. What I’m saying is that in my opinion. is wrong to assume that changing from one transport to the other is causing degradation.
Llamacpp underwent some major changes last few weeks. And following the commits it took few days to stabilise. Try now , works as bliss. And compared to other inference engines such as tinygrad - is much more versatile in options how to be run.