The major models are not tied in terms of quality. GPT-4 and GPT-o1 still beat e...

mrg3_2013 · on Dec 5, 2024

Exactly. Citing cost has been an AWS play which worked during early days of cloud - so they are trying to stick to those plays. They don't work in AI world. No one would want a faster/cheap model that gives poor results (besides the cost of frontier model keeps coming down - so these are just dead initiative IMO).

On LLM, my experience with Claude has been much better than OpenAI models (though my use case is more on code generation)

xnx · on Dec 4, 2024

> GPT-4 and GPT-o1 still beat everyone else by a significant margin on tasks that require in-depth reasoning

I haven't seen examples of this. Do you know where I could find some?

int_19h · on Dec 4, 2024

Here's a fairly simple test that I throw at any model that claims to be "GPT-4 level": https://news.ycombinator.com/item?id=42262661

For more complicated stuff, I did some experiments using LLMs to drive high-level AI decisions in video games. Basically, it gets a data schema and a question like "what do you do next?", and can query the schema to retrieve the info that it thinks it needs to give the best answer to that. GPT-4 and GPT-o1 especially are consistently the best performers there, both in terms of richness of queries they produce, and how they make use of them.

There's also a bunch of interesting examples along the same lines here: https://github.com/cpldcpu/MisguidedAttention. Although I should note that even top OpenAI models have troubles with much of this stuff.

https://github.com/fairydreaming/farel-bench is another interesting benchmark because it's so simple, and yet look at the number disparity in that last column! It's easy to scale, too.

Unfortunately, we're still at the point in this game where even seemingly trivial and unrelated minor changes in the prompt (e.g. slightly rewording it, and even capitalization in some cases) can have large effect on quality of output, which IMO is a tell-tale sign when the model is really operating in a "stochastic parrot" mode more so than any kind of actual reasoning. Thus benchmarks can be used as a way to screen out the poorly performing models, but they cannot reliably predict how well a model will actually do what you need it to do.