Hacker Newsnew | past | comments | ask | show | jobs | submit | iLoveOncall's commentslogin

This is an acceptable solution only if the government doesn't know which platform you are trying to access either.

Everyone makes mistakes, it's good to admit them.

My team still uses Sonnet 3.5 for pretty much everything we do because it's largely enough and it's much, much faster than newer models. The only reason we're switching is because the models are getting deprecated...

If you're on Mac and can't get .nanorc to work, check out https://stackoverflow.com/a/73373788/3876196

It's also possible that you simply do NOT have nano installed at all, and just have the simlink from nano to pico by default. That was my case. In this situation, install nano and it should work.


> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.


What do you mean by reasoning?

If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.

So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).

I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.


I guess we mean here ”usefull reasoning” instead of the idiot-savant. I mean it’s a fair ask since these are marketed as _tools_ you can use to implement _industrial processes_ and even replace you human workers.

In that I guess the model does not need to be the most reasonable intepreter of vague and poorly formulated user inputs but I think to improve a bit at least, to become usefull general appliances and not just test-scoring-automatons.

The key differentiator here is that tests generally _are made to be unambiguously scoreable_. Real world problems are often more vague from the point of view of optimal outcome.


^^This is a great view and it seems generally widely understood by the file and rank techies. I feel pitty for the general public retail investors which are about to be left holding the bag for the VCs, after a certain major <ahem> champion goes into IPO soon.

Thanks. So, people are extending "reasoning" to include making good decisions, rather than just solving logic problems. That makes sense to me that if people use that definition, LLMs are pretty bad at "reasoning".

Although, I would argue that this is not reasoning at all, but rather "common sense" or the ability to have a broader perspective or think of the future. These are tasks that come with experience. That is why these do not seem like reasoning tasks to me, but rather soft skills that LLMs lack. In my mind these are pretty separate concerns to whether LLMs can logically step through problems or apply algorithms, which is what I would call reasoning.


Ah yes then, let me then unchain my LLM on those nasty unsolved math and logic problems I've absolutely not be struggling with in the course of my career.

That's the real deal.

They say LLM are PhD-level. Despite billion dollars, PhD-LLMs sure are not contributing a lot solving known problems. Except of course few limited marketing stunts.


IMHO that's the key differentiator.

You can give a human PhD an _unsolved problem_ in field adjacent to their expertise and expect some reasonable resolution. LLM PhD:s solve only known problems.

That said humans can also be really bad problem solvers.

If you don't care about solving the problem and only want to create paperwork for bureaucracy I guess you don't care either way ("My team's on it!") but companies that don't go out of business generally recognize pretty soon lack of outcomes where it matters.



I wish our press was not effectively muted or bought by the money, so none of the journos has cojones to call out the specific people who were blabbing about PhD-levels, AGI etc. They should be god damn calling them out every single day, essentially doing their job, but they are now too timid for that.

I've "unchained" my LLM on a lot of problems that I probably could solve, but that would take me time I don't have, and that it has solved in many case faster than I could. It may not be good enough to solve problems that are beyond us for most of us, but it certainly can solve a lot of problems for a lot of us that have gone unsolved for lack of resources.

Unless you can show us concrete metrics and problems solved, I am inclined not to believe your statement (source: own intensive experience with the LLMs).

Can solve problems you already know how to solve, if you micro-manage it and it'll BS a lot on the way.

If this is the maximum AGI-PhD-LRM can do, that'll be disappointing compared to investments. Curious to see what all this will become in few years.


Exactly my experience too. Whoever says they're able to solve "very complex" problems with LLMs, is clearly not working on objectively complex problems.

I'm not usually micro-managing it, that's the point.

I sometimes do on problems where I have particular insight, but I mostly find it is far more effective to give it test cases and give it instructions on how to approach a task, and then let it iterate with little to no oversight.

I'm letting Claude Code run for longer and longer with --dangerously-skip-permissions, to the point I'm pondering rigging up something to just keep feeding it "continue" and run it in parallel on multiple problems.

Because at least when you have a good way of measuring success, it works.


A lot of maths students would also struggle to contribute to frontier math problems, but we would still say they are reasoning. Their skill at reasoning might not be as good as professional mathematicians, but that does not stop us from recognising that they can solve logic problems without memorisation, which is a form of reasoning.

I am just saying that LLMs have demonstrated they can reason, at least a little bit. Whereas it seems other people are saying that LLM reasoning is flawed, which does not negate the fact that they can reason, at least some of the time.

Maybe generalisation is one area where LLM's reasoning is weakest though. They can be near-elite performance at nicely boxed up competition math problems, but their performance dramatically drops on real-world problems where things aren't so neat. We see similar problems in programming as well. I'd argue the progress on this has been promising, but other people would probably vehemently disagree with that. Time will tell.


Thank you for picking at this.

A lot of people appear to be - often not consciously or intentionally - setting the bar for "reasoning" at a level many or most people would not meet.

Sometimes that is just a reaction to wanting an LLM that is producing result that is good for their own level. Sometimes it reveals a view of fellow humans that would be quite elitist if stated outright. Sometimes it's a kneejerk attempt at setting the bar at a point that would justify a claim that LLMs aren't reasoning.

Whatever the reason, it's a massive pet peeve of mine that it is rarely made explicit in these conversations, and it makes a lot of these conversations pointless because people keep talking past each other.

For my part a lot of these models often clearly reason by my standard, even if poorly. People also often reason poorly, even when they demonstrably attempt to reason step by step. Either because they have motivations to skip over uncomfortable steps, or because they don't know how to do it right. But we still would rarely claim they are not capable of reasoning.

I wish more evaluations of LLMs would establish a human baseline to test them against for much this reason. It would be illuminating in terms of actually telling us more about how LLMs match up to humans in different areas.


Computers have forever been doing stuff people can't do.

The real question is how useful this tool is and if this is as transformative as investors expect. Understanding its limits is crucial.


> So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet?

The models can learn reasoning rules, but they are not able to apply them consistently or recognize the rules they have learned are inconsistent. (See also my other comment which references comments I made earlier.)

And I think they can't without a tradeoff, as I commented https://news.ycombinator.com/item?id=45717855 ; the consistency requires certain level of close-mindedness.


Yes, so I think in this case we use different definitions of reasoning. You include reliability as a part of reasoning, whereas I do not.

I would argue that humans are not 100% reliable in their reasoning, and yet we still claim that they can reason. So, even though I would agree that the reasoning of LLMs is much less reliable, careful, and thoughtful than smart humans, that does not mean that they are not reasoning. Rather, it means that their reasoning is more unreliable and less well-applied than people. But they are still performing reasoning tasks (even if their application of reasoning can be flawed).

Maybe the problem is that I am holding out a minimum bar for LLMs to jump to count as reasoning (demonstrated application of logical algorithms to solve novel problems in any domain), whereas other people are holding the bar higher (consistent and logical application of rules in all/most domains).


The problem is if you're not able to apply the reasoning rules consistently, then you will always fail on large enough problem. If you have an inconsistent set of reasoning rules, then you can set up a problem as a trap so that the reasoning fails.

You can argue that damaged toaster is still a toaster, conceptually. But if it doesn't work, then it's useless. As it stands, models lack ability to reason because they can fail to reason and you can't do anything about it. In case of humans, it's valid to say they can reason, because humans can at least fix themselves, models can't.


The reasoning does not need to be 100% accurate to be useful. Humans are rarely 100% accurate at anything, and yet over time we can build up large models of problems using verification and review. We can do the exact same thing with LLMs.

The best example of this is Sean Heelan, who used o3 to find a real security vulnerability in the Linux kernel: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

Sean Heelan ran o3 100 times, and it found a known vulnerability in 8% of runs. For a security audit, that is immensely useful, since an expert can spend the time to look at the results from a dozen runs and quickly decide if there is anything real. Even more remarkably though, this same testing exposed a zero-day that they were not even looking for. That is pretty incredible for a system that makes mistakes.

This is why LLM reasoning absolutely does not need to be perfect to be useful. Human reasoning is inherently flawed as well, and yet through systems like peer review and reproducing results, we can still make tremendous progress over time. It is just about figuring out systems of verification and review so that we don't need to trust any LLM output blindly. That said, greater reliability would be massively beneficial to how easy it is to get good results from LLMs. But it's not required.


> then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions.

it could be this is just result of good stochastic parroting and not reasoning. Both of those niches are narrow with high amount of training data (e.g. corps buying solutions from leetcode and training LLMs on them).

From another hand we see that LLMs fail in more complex environment: e.g. ask to build some new feature in postgres database.


This is clearly false. LLMs being able to multiply large numbers is the clear example to me that there is more than just memorisation going on. They cannot just memorise the answers to multipling huge numbers like they do.

That's not to mention that these programming competition problems are designed to be novel. They are as novel as the competition designers can get while sticking to the bounds of the competition. This is clearly not stochastic parrot behaviour.

Additionally, them falling over in large codebases is not evidence that they cannot reason over smaller well-defined problems. It is just evidence that their reasoning has limits, which should not be surprising to anyone. Humans also have limits in our reasoning. That does not mean we do not reason.


I think you just made lots of handwaving statements. Here is result which says LLMs can't do multi-digit multiplications well: https://arxiv.org/pdf/2510.00184

We are talking about reasoning models here, not old non-reasoning models like Llama-90B and GPT-4. Obviously, they cannot multiply numbers. That was never in question.

Maybe at least give a cursory glance at a paper before trying to cite it to support your point?

I find it fun that this paper also points out that using another training method, IcoT, they can produce models that can multiply numbers perfectly. The frontier reasoning models can still make mistakes, they just get very close a lot of the time, even with 10-20 digit numbers. But the IcoT models can do it perfectly, they just can only multiply numbers.


so, give ref on results which prove that they reliably can multiply arbitrary numbers

> Maybe at least give a cursory glance at a paper before trying to cite it to support your invalid point?

they use CoT aka reasoning steps


They do not apply reinforcement learning, which is what most people mean when they talk about reasoning LLMs. This means it is not comparable to the frontier reasoning models.

Here is the post I remember seeing: https://www.reddit.com/r/singularity/comments/1ip3vpa/multid...

This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

Here's another person I saw experimenting with this: https://sanand0.github.io/llmmath/

That's not to mention that if you give these models a Python interpreter, they can also do this task perfectly and tackle much more complicated tasks as well. Although, that is rather separate to the models themselves being able to apply the reasoning steps to multiply numbers.


> reinforcement learning, which is what most people mean when they talk about reasoning LLMs

popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.

imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.

rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.

> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.

Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.


1. This is obviously not about popularity... It is about capability. You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

2. It is literally impossible for the models to have memorised all the results from multiplying 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible (lower-bound), which from an information theory perspective would require a minimum of 48 PiB of data to hold. They have to be applying algorithms internally to perform this task, even if that algorithm is just uncompressing some unbelievably-well-compressed form of the results.

3. If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason? The answer is obviously not. We are trying to demonstrate that LLMs can exhibit reasoning here, not whether or not their reasoning has flaws or limitations (which it obviously does).


> You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

the idea is to train new specialized model, which could specifically demonstrate if LLM can learn multiplication.

> It is literally impossible for the models to have memorised how to multiply 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible

sure, they could memorize fragments: if that fragment contains that seq of digits, then that fragment must contains that seq of digits, which is much smaller space

> If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason?

Human fail because they are weak in this case because they can't reliably do arithmetic, and sometimes make mistake, also I speculate if you give enough time, and ask human to triple check calculations, result will be very good.


We also cap how long we let reasoning LLMs think for. OpenAI researchers have already discussed models they let reason for hours that could solve much harder problems.

But regardless, I feel like this conversation is useless. You are clearly motivated to not think LLMs are reasoning by 1) only looking at crappy old models as some sort of evidence about new models, which is nonsense, and 2) coming up with nonsensical arguments about how they could still be memorising answers that make no sense. Even if they memorised sequences, they still have to put that together to get the exact right answers to 8-digit multiplication in >90% of cases. That requires the application of algorithms, aka reasoning.


> only looking at crappy old model

let me repeat this: it was newly trained specialized model

other rants are ignored.


They did not use modern techniques. Therefore it is meaningless.

That’s not to mention that modern frontier LLMs can also be demonstrated to do this task, which is an existence proof in and of itself.


I am not interested in this discussion anymore. Bye.

What a shame

> The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

It's because they do more compute. The more tokens "spent" the better the accuracy. Same reason they spit out a paragraph of text instead of just giving a straight answer in non-reasoning mode.


I can't remember which paper it's from, but isn't the variance in performance explained by # of tokens generated? i.e. more tokens generated tends towards better performance.

Which isn't particularly amazing, as # of tokens generated is basically a synonym in this case for computation.

We spend more computation, we tend towards better answers.


Don't they have a significant RL component? The "we'll just make it bigger" idea that was peddled a lot after GPT3.5 was nonsense, but that's not the only thing they're doing right now.

"We'll just make it bigger" works. RLVR just gives better performance gains and spends less inference compute - as long as you have a solid way of verifying the tasks.

A simplified way of thinking about it is: pretraining gives LLMs useful features, SFT arranges them into useful configurations, RLVR glues them together and makes them work together well, especially in long reasoning traces. Makes sense to combine it all in practice.

How much pretraining gives an LLM depends on the scale of that LLM, among other things. But raw scale is bounded by the hardware capabilities and the economics - of training and especially of inference.

Scale is still quite desirable - GPT-4.5 scale models are going to become the norm for high end LLMs quite soon.


I'm not against "we'll make it bigger" (although it's as of yet unknown if it hits diminishing returns, 4.5 isn't exactly remembered as a great release), I'm against "we'll just (i.e. 'only') make it bigger".

I'm doubtful you'd have useful LLMs today if labs hadn't scaled in post-training.


> The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

Why is that amazing? It seems expected. Use a tool differently, get different results.


This is a completely meaningless article if they don't provide information about their technical stack, which AWS services they used to use, what TPS they are hitting, what storage size they're using, etc.

The story will be different for every business because every business has different needs.

Given the answer to "How much did migration and ongoing ops really cost?" it seems like they had an incredibly simple infrastructure on AWS, and it was really easy to move out. If you use a wider-range of services the cost savings are much more likely to cancel themselves.


TFA begins with a link to the original article with those details.

If you called "We used EKS" details, then yeah they provide those details.

Assuming this is indeed all they used, this was admittedly nonsense, they were essentially using cloud-based bare-metal.


GPT-1 or GPT-2, which would match his reasoning abilities, are unfortunately deprecated, so we can't.

> 1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time.

This is true.

> Most services start with us-east-1 first.

This is absolutely false. Almost every service will FINISH with the largest and most impactful regions.


Agreed. Most services start deployments on a small number of hosts in single AZs in small less-known regions, ramping up from there. In all my years there I don’t recall “us-east-1 first”.


Each AWS service may choose different pipeline ordering based on the risks specific to their architecture.

In general:

You don't deploy to the largest region first because of the large blast radius.

You may not want to deploy to the largest region last because then if there's an issue that only shows up at that scale you may need to roll every single region back (divergent code across regions is generally avoided as much as possible).

A middle ground is to deploy to the largest region second or third.


Yeah, even without taking that into account, the bias is obvious and extreme.


They bought a cool url gotta give em that.

looking up stateof.lol


Except in reality it's ALL marketing terms for 2 things: additional prompt sections, and APIs.


I more or less agree, but it’s surprising what naming a concept does for the average user.

You see a text file and understand that it can be anything, but end users can’t/won’t make the jump. They need to see the words Note, Reminder, Email, etc.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: