Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.





K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.

If it had vision and was better on long context I'd use it so much more.


This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.


yes, they turned off all energy economy measures when benchmarking software activity was detected, which completely broke the point of the benchmarks because your phone is useless if it's very fast but the battery lasts one hour

This was a bad problem with earlier Chinese (Qwen and Kimi K1 in particular) models, but the original DeepSeek delivered and GLM4.6 delivers. They don't diversify training as much as American labs so you'll find more edge cases and the interaction experience isn't quite as smooth, but the models put in work.

I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.

Weird, I have gone local for the last 2 years. I use Chinese models 90% of the time, Kimi K2 Thinking, DeepSeekv3.Terminus, Qwen3 and GLM4.6. I'm not vibe testing it but really putting them to use and they do keep up great.

My experience with deepseek and Kimi is quite the opposite: smarter than benchmarks would imply

Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement


What is "Vibe testing"?

He means capturing things that benchmarks don't. You can use Claude and GPT-5 back-to-back in a field that score nearly identically on. You will notice several differences. This is the "vibe".

I would assume that it is testing how well and appropriately the LLM responds to prompts.

This is why I stopped bothering checking out these models and, funnily enough, grok.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: