> reinforcement learning, which is what most people mean when they talk about reasoning LLMs
popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.
imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.
rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.
> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.
it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.
Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.
1. This is obviously not about popularity... It is about capability. You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.
2. It is literally impossible for the models to have memorised all the results from multiplying 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible (lower-bound), which from an information theory perspective would require a minimum of 48 PiB of data to hold. They have to be applying algorithms internally to perform this task, even if that algorithm is just uncompressing some unbelievably-well-compressed form of the results.
3. If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason? The answer is obviously not. We are trying to demonstrate that LLMs can exhibit reasoning here, not whether or not their reasoning has flaws or limitations (which it obviously does).
> You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.
the idea is to train new specialized model, which could specifically demonstrate if LLM can learn multiplication.
> It is literally impossible for the models to have memorised how to multiply 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible
sure, they could memorize fragments: if that fragment contains that seq of digits, then that fragment must contains that seq of digits, which is much smaller space
> If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason?
Human fail because they are weak in this case because they can't reliably do arithmetic, and sometimes make mistake, also I speculate if you give enough time, and ask human to triple check calculations, result will be very good.
We also cap how long we let reasoning LLMs think for. OpenAI researchers have already discussed models they let reason for hours that could solve much harder problems.
But regardless, I feel like this conversation is useless. You are clearly motivated to not think LLMs are reasoning by 1) only looking at crappy old models as some sort of evidence about new models, which is nonsense, and 2) coming up with nonsensical arguments about how they could still be memorising answers that make no sense. Even if they memorised sequences, they still have to put that together to get the exact right answers to 8-digit multiplication in >90% of cases. That requires the application of algorithms, aka reasoning.
popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.
imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.
rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.
> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.
it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.
Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.