I think you just made lots of handwaving statements. Here is result which says LLMs can't do multi-digit multiplications well: https://arxiv.org/pdf/2510.00184
We are talking about reasoning models here, not old non-reasoning models like Llama-90B and GPT-4. Obviously, they cannot multiply numbers. That was never in question.
Maybe at least give a cursory glance at a paper before trying to cite it to support your point?
I find it fun that this paper also points out that using another training method, IcoT, they can produce models that can multiply numbers perfectly. The frontier reasoning models can still make mistakes, they just get very close a lot of the time, even with 10-20 digit numbers. But the IcoT models can do it perfectly, they just can only multiply numbers.
They do not apply reinforcement learning, which is what most people mean when they talk about reasoning LLMs. This means it is not comparable to the frontier reasoning models.
This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.
That's not to mention that if you give these models a Python interpreter, they can also do this task perfectly and tackle much more complicated tasks as well. Although, that is rather separate to the models themselves being able to apply the reasoning steps to multiply numbers.
> reinforcement learning, which is what most people mean when they talk about reasoning LLMs
popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.
imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.
rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.
> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.
it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.
Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.
1. This is obviously not about popularity... It is about capability. You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.
2. It is literally impossible for the models to have memorised all the results from multiplying 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible (lower-bound), which from an information theory perspective would require a minimum of 48 PiB of data to hold. They have to be applying algorithms internally to perform this task, even if that algorithm is just uncompressing some unbelievably-well-compressed form of the results.
3. If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason? The answer is obviously not. We are trying to demonstrate that LLMs can exhibit reasoning here, not whether or not their reasoning has flaws or limitations (which it obviously does).
> You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.
the idea is to train new specialized model, which could specifically demonstrate if LLM can learn multiplication.
> It is literally impossible for the models to have memorised how to multiply 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible
sure, they could memorize fragments: if that fragment contains that seq of digits, then that fragment must contains that seq of digits, which is much smaller space
> If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason?
Human fail because they are weak in this case because they can't reliably do arithmetic, and sometimes make mistake, also I speculate if you give enough time, and ask human to triple check calculations, result will be very good.
We also cap how long we let reasoning LLMs think for. OpenAI researchers have already discussed models they let reason for hours that could solve much harder problems.
But regardless, I feel like this conversation is useless. You are clearly motivated to not think LLMs are reasoning by 1) only looking at crappy old models as some sort of evidence about new models, which is nonsense, and 2) coming up with nonsensical arguments about how they could still be memorising answers that make no sense. Even if they memorised sequences, they still have to put that together to get the exact right answers to 8-digit multiplication in >90% of cases. That requires the application of algorithms, aka reasoning.