I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.
With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.
When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.
The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.
Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".
But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.
Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.
Your proposed approach to science would result in the extremely tiny subset of math, probably theorems being proven by automation. And it is questionable if those theorems would be even useful. A good mathematician with CS experience can probably write a generator of new useless theorems, something along "are every sequential cube plus square of a number divisible by a root of seventh smallest prime multiplied by logn of than number plus blabla...". One can generate such theorrems and formally prove or disprove them, yes.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
> You can't formally prove that some gene sequence is responsible for trait X etc.
Maybe not formally in some kind of mathematical sense. But you certainly could have simulation models of protein synthesis, and maybe even higher order simulation of tissues and organs. You could also let the ai scientist verify the experimental hypothesis by giving access to robotic lab processes. In fact it seems we are going down both fronts right now.
Nobody argues that LLMs aren't useful for some bulk processing of billion datapoints or looking for obscure correlations in the unedited data. But the premise of the Gwern's article is that to be considered thinking, LLM must initiate such search on it's own and arrive to a novel conclusion on it's own.
Basically if:
A) Scientist has an idea > triggers LLM program to sift through a ton of data > LLM print out correlation results > scientist read them and proves/disproves an idea. In this case, while LLM did a bulk of work here, it did not arrive at a breakthrough on its own.
B) LLM is idling > then LLM triggers some API to get some specific set of data > LLM correlates results > LLM prints out a complete hypothesis with proof (or disproves it). In this case we can say that LLM did a breakthrough.
I think the problem here is that you assume the LLM has to operate isolated from the world, i.e. without interaction. If you put a human scientist in isolation, then you cannot have high expectations either.
I assume not that LLM would be isolated, I assume that LLM would be incapable of interacting in any meaningful way on its own (i.e. not triggered by direct input from a programmer).
IME, on a daily basis, Claude Code (supposed SoTA agent) constantly disables and bypasses tests and checks on my codebase - despite following clear prompting guidelines and all the /woo/ like ultrathink etc.
I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work.
But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.
True, and the successful ones usually require an external source of information.
For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human.
In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.
The LLM doesn't have to know about the critic though. It can just output things and the critic is a second process that filters the output for the end user.
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.