> GPT-4 and GPT-o1 still beat everyone else by a significant margin on tasks tha...

int_19h · on Dec 4, 2024

Here's a fairly simple test that I throw at any model that claims to be "GPT-4 level": https://news.ycombinator.com/item?id=42262661

For more complicated stuff, I did some experiments using LLMs to drive high-level AI decisions in video games. Basically, it gets a data schema and a question like "what do you do next?", and can query the schema to retrieve the info that it thinks it needs to give the best answer to that. GPT-4 and GPT-o1 especially are consistently the best performers there, both in terms of richness of queries they produce, and how they make use of them.

There's also a bunch of interesting examples along the same lines here: https://github.com/cpldcpu/MisguidedAttention. Although I should note that even top OpenAI models have troubles with much of this stuff.

https://github.com/fairydreaming/farel-bench is another interesting benchmark because it's so simple, and yet look at the number disparity in that last column! It's easy to scale, too.

Unfortunately, we're still at the point in this game where even seemingly trivial and unrelated minor changes in the prompt (e.g. slightly rewording it, and even capitalization in some cases) can have large effect on quality of output, which IMO is a tell-tale sign when the model is really operating in a "stochastic parrot" mode more so than any kind of actual reasoning. Thus benchmarks can be used as a way to screen out the poorly performing models, but they cannot reliably predict how well a model will actually do what you need it to do.