Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The argument that I've heard against LLMs for code is that they create bugs that, by design, are very difficult to spot.

The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code. So the bugs often won't be like those a programmer makes. Instead, they can introduce a whole new class of bug that's way harder to debug.



This is exactly what I wrote about when I wrote "Copilot Induced Crash" [0]

Funny story: when I first posted that and had a couple of thousand readers, I had many comments of the type "you should just read the code carefully on review", but _nobody_ pointed out the fact that the opening example (the so called "right code") had the exact same problem as described in the article, proving exactly what you just said: it's hard to spot problems that are caused by plausibility machines.

[0] https://www.bugsink.com/blog/copilot-induced-crash/


If it crashes, you are very lucky.

AI generated code will fuck up so many lives. The post office software in the UK did it without AI. I cannot imagine the way and the number of lives will be ruined since some consultancy vibe coded some government system. I might come to appreciate the German bureaucracy and backwardness.


My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.


That rather depends on the type of bug and what kinds of tests you would write.

LLMs are way faster than me at writing tests. Just prompt for the kind of test you want.


Idk about you but I spend much more time thinking about what ways the code is likely to break and deciding what to test. Actually writing tests is usually straightforward and fast with any sane architecture with good separation of concerns.

I can and do use AI to help with test coverage but coverage is pointless if you don’t catch the interesting edge cases.


> My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.

Maybe use one LLMs to write the code and a wildly different one to write the tests and yet another wildly different one to generate an English description of each test while doing critical review.


Disagree. You could write millions of tests for a function that simply sums two numbers, and it’s trivial to insert bugs while passing that test.


This is pretty nifty, going to try this out!


I don't agree. What I do agree on is to do it not only with one LLM.

Quality increases if I double check code with a second LLM (especially o4 mini is great for that)

Or double check tests the same way.

Maybe even write tests and code with different LLMs if that is your worry.


Yes, exactly - my (admittedly very limited!) experience has consistently generated well-written, working code that just doesn’t quite do what I asked. Often the results will be close to what I expect, and the coding errors do not necessarily jump out on a first line-by-line pass, so if I didn’t have a high degree of skepticism of the generated code in the first place, I could easily just run with it.


> working code that just doesn’t quite do what I asked

Code that doesn't do what you want isn't "working", bro.

Working exactly to spec is the code's only job.


It is a bit ambiguous I think, there is also the meaning of "the code compiles/runs without errors". But I also prefer the meaning of, "code that is working to the spec".


For me it's mostly about the efficiency of the code they write. This is because I work in energy where efficiency matters because our datasets are so ridicilously large and every interface to that data is so ridicilously bad. I'd argue that for 95% of the software out there it won't really matter if you use a list or a generator in Python to iterate over data. It probably should and maybe this will change with cloud costs continious increasing, but we do also live in a world where 4chan ran on some apache server running a 10k line php file from 2015...

Anyway, this is where AI's have been really bad for us. As well as sometimes "overengineering" their bug prevention in extremely inefficient ways. The flip-side of this is of course that a lot of human programmers would make the same mistakes.


I’ve had the opposite experience. Just tell it to optimise for speed and iterate and give feedback. I’ve had JS code optimised specifically for v8 using bitwise operations. It’s brilliant.


Example code or it's just a claim :)


Note that it's a claim in response to another claim. It doesn't need to be held to a higher standard than its parent.


>Instead, they can introduce a whole new class of bug that's way harder to debug

That sounds like a new opportunity for a startup that will collect hundreds of millions a of dollars, brag about how their new AI prototype is so smart that it scares them, and devliver nothing


> There's no logic gone into writing that bit of code.

What makes you say that? If LLMs didn't reason about things, they wouldn't be able to do one hundredth of what they do.


This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.


No, YOU misunderstand. This isn't a thing RL can fix

  https://news.ycombinator.com/item?id=44163194

  https://news.ycombinator.com/item?id=44068943
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?

Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.


Those links mostly discuss the original RLHF used to train e.g. ChatGPT 3.5. Current paradigms are shifting towards RLVR (reinforcement learning with verifiable rewards), which absolutely can optimize good programs.

You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.

This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.


  > Current paradigms are shifting towards RLVR, which absolutely can optimize good programs
I think you've misunderstood. RL is great. Hell, RLHF has done a lot of good. I'm not saying LLM are useless.

But no, it's much more complex than you claim. RLVM can optimize for correct answers in the narrow domains where there are correct answers but it can't optimize good programs. There's a big difference.

You're right that Lean, Coq, and other ATPs can prove mathematical statements, but they also don't ensure that a program is good. There's frequently an infinite number of proofs that are correct, but most of those are terrible proofs.

This is the same problem all the coding benchmarks face. Even if the LLM isn't cheating, testing isn't enough. If it was we'd never do code review lol. I can pass a test with an algorithm that's O(n^3) despite there being an O(1) solution.

You're right that it makes it better, but it doesn't fix the underlying problem I'm discussing.

Not everything is verifiable.

Verifiability isn't enough.

If you'd like to prove me wrong in the former you're going to need to demonstrate that there are provably true statements to lots of things. I'm not expecting you to defy my namesake, nor will I ask you prove correctness and solve the related halting problem.

You can't prove an image is high fidelity. You can't prove a song sounds good. You can't prove a poem is a poem. You can't prove this sentence is English. The world is messy as fuck and most things are highly subjective.

But the problem isn't binary, it is continuous. I said we're using Justice Potter optimization, you can't even define what porn is. These definitions change over time, often rapidly!

You're forgetting about the tyrannical of metrics. Metrics are great, powerful tools that are incredibly useful. But if you think they're perfectly aligned with what you intend to measure then they become tools that work against you. Goodhart's Law. Metrics only work as guides. They're no different than any other powerful tool, if you use it wrong you get hurt.

If you really want to understand this I really encourage you to deep dive into this stuff. You need to get into the math. Into the weeds. You'll find a lot of help with metamathematics (i.e. my namesake), metaphysics (Ian Hacking is a good start), and such. It isn't enough to know the math, you need to know what the math means.


The question at hand was whether LLMs could be trained to write good code. I took this to mean "good code within the domain of software engineering," not "good code within the universe of possible programs." If you interpreted it to mean the latter, so be it -- though I'm skeptical of the usefulness of this interpretation.

If the former, I still think that the vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL. Whether the resulting optimized programs would be considered "good" depends on your definition of "good." I suspect mine is more utilitarian than yours (as even after some thought I can't conceive of what a "terrible" proof might look like), but I am skeptical that your code review will prove to be a better measure of goodness than a broad suite of unit tests/verifiers/metrics -- which, to my original last point, are only getting more robust! And if these aren't enough, I suspect the addition of LLM-as-a-judge (potentially ensembles) checking for readability/maintainability/security vulnerabilities will eventually put code quality above that of what currently qualifies as "good" code.

Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me -- can you point to categories of extant software that could not be hillclimbed via RL? Or is this just a fundamental disagreement about what it means for software to be "good"? At any rate, I think we can agree that the original claim that "The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code" is wrong in the context of RL.


  > I took this to mean "good code within the domain of software engineering," not "good code within the universe of possible programs.
We both mean the same thing. The reasonable one. The only one that even kinda makes sense: good enough code

  > vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL
Yes, hill climbed. But that's different than "towards good"

Here's the difference[0]. You'll find another name for Goodhart's Law in any intro ML course. Which is why it is so baffling that 1) this is contentious 2) it is the status quo in research now

Your metrics are only useful if you understand them

Your measures are only as good as your attention

And it is important to distinguish metrics from measures. They are different things. Both are proxies

  > Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me
Maybe you're unfamiliar with diffusion models?[1]

They are examples where it is hopefully clearer that these things are hard to define. If you have good programming skills you should be able to make the connection back to what this has to do with my point. If not, I'm actually fairly confident GPT will be able to do so. There's more than enough in its training data to do that.

[0] https://en.wikipedia.org/wiki/Goodhart%27s_law

[1] https://stability.ai/


Now I'm confused -- you're claiming you meant "good enough code" when your previous definition was such that even mathematical proofs could be "terrible"? That doesn't make sense to me. In software engineering, "good enough" has reasonably clear criteria: passes tests, performs adequately, follows conventions, etc. While these are imperfect proxies, they're sufficient for most real-world applications, and crucially -- measurable. And my claim is that they will be more than adequate to get LLMs to produce good code.

And again, diffusion models aren't relevant here. The original comment was about LLMs producing buggy code -- not RL's general limitations in other domains. Diffusion models' tensors aren't written by hand.


  > Now I'm confused ... that even mathematical proofs could be "terrible"? That doesn't make sense to me.
You know there's plenty of ways to prove things, right? Like there's not a single proof. Here's a few proofs for pi being irrational[0]. The list is not comprehensive.

Take that like you do with code. They all generate the same final output. They're all correct. But is one better than another? Yes, yes it is. But which one that is depends on context.

  > and crucially -- measurable
This is probably a point of contention. Measuring is far more difficult than people think. A lot of work goes into creating measurements and we get a nice ruler at the end. The problem isn't just that initial complexity, it is that every measure is a proxy. Even your meter stick doesn't measure a meter. What distinguishes the engineer from the hobbyist is the knowledge of alignment.

  How well does my measure align with what I intend to measure?
That's a very hard problem. How often do you ask yourself that? I'm betting not enough. Frankly, most things aren't measurable.

[0] https://proofwiki.org/wiki/Pi_is_Irrational#:~:text=Hence%20...


I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.


> I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance.

I very strongly disagree with this and think this reflects a misunderstanding of model capabilities. This sort of agentic loop with access to ground truth model has been tried in one form or another ever since GPT-3 came out. For four years they didn't work. Models would very quickly veer into incoherence no matter what tooling you gave them.

Only in the last year or so have models gotten capable enough to maintain coherence over long enough time scales that these loops work. And future model releases will tighten up these loops even more and scale them out to longer time horizons.

This is all to say that progress in code production has been essentially driven by progress in model capabilities, and agent loops are a side effect of that rather than the main driving force.


Sure! Super happy to hear these kinds of objections because, while all the progress I'm personally perceiving is traceable to decisions different agent frameworks seem to be making, I'm totally open to the idea that model improvements have been instrumental in making these loops actually converge anywhere practical. I think near the core of my argument is simply the idea that we've crossed a threshold where current models plus these kinds of loops actually do work.


  > I don't know if any of this applies to the arguments

  > with access to ground truth
There's the connection. You think you have ground truth. No such thing exists


It's even simpler than what 'rfrey said. You're here using "ground truth" in some kind of grand epistemic sense, and I simply mean "whether the exit code from a program was 1 or 0".

You can talk about how meaningful those exit codes and error messages are or aren't, but the point is that they are profoundly different than the information an LLM natively operates with, which are atomized weights predicting next tokens based on what an abstract notion of a correct line of code or an error message might look like. An LLM can (and will) lie to itself about what it is perceiving. An agent cannot; it's just 200 lines of Python, it literally can't.


  > You're here using "ground truth" in some kind of grand epistemic sense
I used the word "ground truth" because you did!

  >> in agent loops with access to ground truth about whether things compile and pass automatic acceptance.
Your critique about "my usage of ground truth" is the same critique I'm giving you about it! You really are doing a good job at making me feel like I'm going nuts...

  > the information an LLM natively operates with,
And do you actually know what this is?

I am a ML researcher you know. And one of those ones that keeps saying "you should learn the math." There's a reason for this, because it is really connected to what you're talking about here. They are opaque, but they sure aren't black boxes.

And it really sounds like you're thinking the "thinking" tokens are remotely representative of the internal processing. You're a daily HN user, I'm pretty sure you saw this one[0].

I'm not saying anything OpenAI hasn't[1]. I just recognize that this applies to more than a very specific narrow case...

[0] https://news.ycombinator.com/item?id=44074111

[1] https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def563...


Right, I'm just saying, I meant something else by the term than you did. Again: my point is, the math of the LLM doesn't matter to the point I'm making. It's not the model figuring out whether the code actually compiled. It's 200 lines of almost straight-line Python code that has cracked the elusive computer science problem of running an executable and checking the exit code.


  > the math of the LLM doesn't matter to the point I'm making.
The point I'm making is that to make effective use out of a tool, you should know what the tool can and can't do. Really the "all models are wrong, but some models are useful" paradigm. To know which models are useful you have to know how your models are wrong.

Sure, you can blindly trust too. But that can get pretty dangerous. While most of the time we leverage high levels of trust, I'm unconvinced our models allow us to trust them. Without being able to strongly demonstrate that they do not optimize tricking us (in our domains of interest) then they should be treated as distrustful, not trustful.


The part of the tool that I'm "blindly trusting" is the part any competent programmer can reason about.


Yes it does. Ground truth is what 30 years of experience says constitutes mergeable code. Ground truth doesn't mean "perfect, provably correct code", it means whatever your best benchmark for acceptable code is.

In medical AI, where I'm currently working, "ground truth" is usually whatever human experts say about a medical image, and is rarely perfect. The goal is always to do better than whatever the current ground truth is.


I understand how you interpreted my comment as this. That's my bad.

But even when taking state of the art knowledge as a ground truth aligning to that is incredibly hard. Medicine is a great example. You're trying to create a causal graph in a highly noisy environment. You ask 10 doctors and you'll get 12 diagnoses. The problem is subtle things become incredibly important. Which is exactly what makes measurements so fucking hard. There is no state of the art in a well defined sense.

The point is that in most domains this is how things are. Even in programming.

Getting the right answer isn't enough


This is just semantics. What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it? If the model can write code that passes tests, and meets my requirements, then it's a good programmer. I would expect nothing more or less out of a human programmer.


> What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it?

Correctness.

> and meets my requirements

It can't do that. "My requirements" wasn't part of the training set.


"Correctness" in what sense? It sounds like it's being expanded to an abstract academic definition here. For practical purposes, correct means whatever the person using it deems to be correct.

> It can't do that. "My requirements" wasn't part of the training set.

Neither are mine, the art of building these models is that they are generalisable enough that they can tackle tasks that aren't in their dataset. They have proven, at least for some classes of tasks, they can do exactly that.


  > to an abstract academic definition here
Besides the fact that your statement is self contradicting, there is actually a solid definition [0]. You should click the link on specification too. Or better yet, go talk to one of those guys that did their PhD in programming languages.

  > They have proven
Have they?

Or did you just assume?

Yeah, I know they got good scores on those benchmarks but did you look at the benchmarks? Look at the question and look what is required to pass it. Then take a moment and think. For the love of God, take a moment and think about how you can pass those tests. Don't just take a pass at face value and move on. If you do, well I got a bridge to sell you.

[0] https://en.wikipedia.org/wiki/Correctness_(computer_science)


Sure,

> In theoretical computer science, an algorithm is correct with respect to a specification if it behaves as specified.

"As specified" here being the key phrase. This is defined however you want, and ranges from a person saying "yep, behaves as specified", to a formal proof. Modern language language models are trained under RL for both sides of this spectrum, from "Hey man looks good", to formal theorem proving. See https://arxiv.org/html/2502.08908v1.

So I'll return to my original point: LLMs are not just generating outputs that look plausible, they are generating outputs that satisfy (or at least attempt to satisfy) lots of different objectives across a wide range of requirements. They are explicitly trained to do this.

So while you argue over the semantics of "correctness", the rest of us will be building stuff with LLMs that is actually useful and fun.


You have to actually read more than the first line of a Wikipedia article to understand it

  > formal theorem proving
You're using Coq and Lean?

I'm actually not convinced you read the paper. It doesn't have anything to do with your argument. Someone using LLMs with formal verification systems is wildly different than LLMs being formal verification systems.

This really can't work if you don't read your own sources


> they are generating outputs that satisfy (or at least attempt to satisfy) lots of different objectives across a wide range of requirements

No they aren't. You were lied to by the hype machine industry. Sorry.

The good news is that there's a lot of formerly intractable problems that can now be solved by generating plausible output. Programming is just not one of them.


> No they aren't. You were lied to by the hype machine industry. Sorry.

Ok. My own empirical evidence is in favour of these things being useful, and useful enough to sell their output (partly), but I'll keep in mind that I'm being lied to.


Quite a huge leap from "these things are useful" to "these things can code".

(And yes, this leap is the lie you're being sold. "LLMs are kinda useful" is not what led to the LLM trillion dollar hype bubble.)


The thing I'm using them for is coding though...


Is your grandma qualified to determine what is good code?

  > If the model can write code that passes tests
You think tests make code good? Oh my sweet summer child. TDD has been tried many times and each time it failed worse than the last.


Good to know something i've been doing for 10 years consistently could never work.


It's okay, lots of people's code is always buggy. I know people that suck at coding and have been doing it for 50 years. It's not uncommon

I'm not saying don't make tests. But I am saying you're not omniscient. Until you are, your tests are going to be incomplete. They are helpful guides, but they should not drive development. If you really think you can test for every bug then I suggest you apply to be Secretary for health.

https://hackernoon.com/test-driven-development-is-fundamenta...

https://geometrian.com/projects/blog/test_driven_development...


> It's okay, lots of people's code is always buggy. I know people that suck at coding and have been doing it for 50 years. It's not uncommon

Are you saying you're better than that? If you think you're next to perfect then I understand why you're so against the idea that an imperfect LLM could still generate pretty good code. But also you're wrong if you think you're next to perfect.

If you're not being super haughty, then I don't understand your complaints against LLMs. You seem to be arguing they're not useful because they make mistakes. But humans make mistakes while being useful. If the rate is below some line, isn't the output still good?


Ive worked with people who write tests afterwards on production code and it's pretty inevitable that they:

* End up missing tests for edge cases they built and forgot about. Those edge cases often have bugs.

* They forget and cover the same edge cases twice if theyre being thorough with test-after. This is a waste.

* They usually end up spending almost as much time manually testing in the end to verify the code change they just made worked whereas I would typically just deploy straight to prod.

It doesnt prevent all bugs it just prevents enough to make the teams around us who dont do it look bad by comparison even though they do manual checks too.

Ive heard loads of good reasons to not write tests at all, Ive yet to hear a good reason to not write one before if you are going to write one.

Both of your articles raise pretty typical straw men. One is "what if im not sure what the customer wants?" (thats fine but i hope you arent writing production code at this point) and the other is the peculiar but common notion that TDD can only be done with a low level unit test which is dangerous bullshit.


Sure, you work with some bad programmers. Don't we all?

The average driver thinks they're above average. The same is true about programmers.

I do disagree a bit with the post and think you should write tests while developing. Honestly, I don't think they'll disagree. I believe they're talking about a task rather than the whole program. Frankly, no program is ever finished so in that case you'd never write tests lol.

I believe this because they start off saying it wasn't much code.

But you are missing the point. From the first link

  > | when the tests all pass, you’re done
  > Every TDD advocate I have ever met has repeated this verbatim, with the same hollow-eyed conviction.
These aren't strawmen. These are questions you need to constantly be asking yourself. The only way to write good code is to doubt yourself. To second guess. Because that's what drives writing better tests.

I actually don't think you disagree. You seem to perfectly understand that tests (just like any other measure) are guides, not answers. That there's much more to this than passing tests.

But the second D in TDD is what's the problem. Tests shouldn't drive development, they are just part of development. The engineer writing tests at the end is inefficient, but the engineer that writes tests at the beginning is arrogant. To think you can figure it out before writing the code is laughable. Maybe some high level broad tests are feasible but that's only going to be a very small portion.

You can do hypothesis driven development, but people will call you a perfectionist and say you're going to slow. By HDD I mean you ask "what needs to happen, how would I know that is happening?" Which very well might involve creating tests. Any scientist is familiar with this but also familiar with its limits


TDD is not a panacea, it's an effective, pragmatic practice with several benefits and little to no downsides compared to test after.

Im not sure what you're saying, really but I dont think it disagrees with this central point in any specific way.


"Good" is the context of LLMs means "plausible". Not "correct".

If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.


Not necessarily. If the RL objective is passing tests then in the context of LLMs it means "correct", or at least "correct based on the tests".


Unfortunately that doesn't solve the problem in any way. We don't have an Oracle machine for testing software.

If we did, we could autogenerate code even without an LLM.


They are also trained with RL to write code to pass unit tests and Claude does have a big problem with trying to cheat the test or request pretty quickly after running into issues, making manual edit approval more important. It usually still tells what it is trying to do wrong so you can often find out from its summary before having to scan the diff.


This can happen, but in practice, given I'm reviewing every line anyway, it almost never bites me.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: