> So, that makes me come back to this question of what definition of reasoning d...

sothatsit · 2025-10-31T11:05:01 1761908701

Yes, so I think in this case we use different definitions of reasoning. You include reliability as a part of reasoning, whereas I do not.

I would argue that humans are not 100% reliable in their reasoning, and yet we still claim that they can reason. So, even though I would agree that the reasoning of LLMs is much less reliable, careful, and thoughtful than smart humans, that does not mean that they are not reasoning. Rather, it means that their reasoning is more unreliable and less well-applied than people. But they are still performing reasoning tasks (even if their application of reasoning can be flawed).

Maybe the problem is that I am holding out a minimum bar for LLMs to jump to count as reasoning (demonstrated application of logical algorithms to solve novel problems in any domain), whereas other people are holding the bar higher (consistent and logical application of rules in all/most domains).

js8 · 2025-10-31T12:09:38 1761912578

The problem is if you're not able to apply the reasoning rules consistently, then you will always fail on large enough problem. If you have an inconsistent set of reasoning rules, then you can set up a problem as a trap so that the reasoning fails.

You can argue that damaged toaster is still a toaster, conceptually. But if it doesn't work, then it's useless. As it stands, models lack ability to reason because they can fail to reason and you can't do anything about it. In case of humans, it's valid to say they can reason, because humans can at least fix themselves, models can't.

sothatsit · 2025-10-31T12:57:44 1761915464

The reasoning does not need to be 100% accurate to be useful. Humans are rarely 100% accurate at anything, and yet over time we can build up large models of problems using verification and review. We can do the exact same thing with LLMs.

The best example of this is Sean Heelan, who used o3 to find a real security vulnerability in the Linux kernel: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

Sean Heelan ran o3 100 times, and it found a known vulnerability in 8% of runs. For a security audit, that is immensely useful, since an expert can spend the time to look at the results from a dozen runs and quickly decide if there is anything real. Even more remarkably though, this same testing exposed a zero-day that they were not even looking for. That is pretty incredible for a system that makes mistakes.

This is why LLM reasoning absolutely does not need to be perfect to be useful. Human reasoning is inherently flawed as well, and yet through systems like peer review and reproducing results, we can still make tremendous progress over time. It is just about figuring out systems of verification and review so that we don't need to trust any LLM output blindly. That said, greater reliability would be massively beneficial to how easy it is to get good results from LLMs. But it's not required.