Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The LLMs are generating training data at a faster rate than SO. All the prompts and the responses will eventually be 99.99% of the training data.


Surely you are joking.

You want us to rely on models that are overfit to hallucinated LLM interactions.


Just open enough issues on the parent libraries that they give up and conform to the hallucinations.


I’ve been doing this in my private codebase. When copilot hallucinates a function, I just go and write the thing. It’s usually a good idea, and it will re-hallucinate the same function independently in another file.


The only way this is useful in the context of code is if:

* The LLMs have a sufficient "understanding" of the request and of how to write code to fulfill the request

* Have a way to validate the suggestion by actually executing the code (at least during training) and inspecting the output

From what I've seen we are still far away from that, Copilot and GPT-4 seem heavily reliant on very well-commented code and on sources like Stackoverflow


does this not create a feed back loop, if you're training data based on things the LLM said?


They're probably generating based on GitHub code.

If I were training a code model I'd take a snippet of code, have the existing LLM explain it. Then use the explanation and the snippet for the test data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: