The LLMs are generating training data at a faster rate than SO. All the prompts ...

DSingularity · on Feb 23, 2024

Surely you are joking.

You want us to rely on models that are overfit to hallucinated LLM interactions.

bongobingo1 · on Feb 23, 2024

Just open enough issues on the parent libraries that they give up and conform to the hallucinations.

clbrmbr · on Feb 23, 2024

I’ve been doing this in my private codebase. When copilot hallucinates a function, I just go and write the thing. It’s usually a good idea, and it will re-hallucinate the same function independently in another file.

the_duke · on Feb 23, 2024

The only way this is useful in the context of code is if:

* The LLMs have a sufficient "understanding" of the request and of how to write code to fulfill the request

* Have a way to validate the suggestion by actually executing the code (at least during training) and inspecting the output

From what I've seen we are still far away from that, Copilot and GPT-4 seem heavily reliant on very well-commented code and on sources like Stackoverflow

vorticalbox · on Feb 23, 2024

does this not create a feed back loop, if you're training data based on things the LLM said?

tinco · on Feb 23, 2024

They're probably generating based on GitHub code.

If I were training a code model I'd take a snippet of code, have the existing LLM explain it. Then use the explanation and the snippet for the test data.