Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The future of training seems to, at least partly, be in synthetic data. I can imagine systems where a “data synthesizer” LLM is trained on open data and probably some licensed data. The synthesizer then generates data “to spec” to train larger models. MOE type models will likely have different approaches in so far as something like a Mathematical expert likely gets a long way with training data from out of copyright works by Newton, Euler, et al.


It's already how we fine-tune open source LLMs. All of them live off data exfiltrated from GPT-4. And it seems to help closing the gap fast. Microsoft had a whole family of papers on this idea: TinyStories, Phi-1, Phi-1.5, Phi-2...

Synthetic data has many advantages - it is free of copyright issues, the downstream models can't possibly violate copyright if they never saw the copyrighted works to begin with.

It is also more diverse and we can ensure higher average quality and less bias. It can also merge information across multiple sources. Sometimes we can filter using feedback from code execution, simulations, preference models or humans. If you can "execute" the LLM output and get a score, you're on to a self improving loop. LLMs can act as agents, collecting their own experiences and feedback.

I think GPTs are a ploy by OpenAI to collect synthetic data with human-in-the-loop and tools, to improve their datasets. This would also be in-domain for users and for LLM errors. They would contain LLM errors and the feedback. Very good data, on-policy. My estimations for 100M users at 10K tokens per month per user is 1T synthetic tokens per month. In a year they double the size of the GPT-4 training set. And we're paying and working for it.

But fortunately 12 months after they release GPT-5 we will recover 90% of its abilities in open source models.


> Synthetic data has many advantages - it is free of copyright issues, the downstream models can't possibly violate copyright if they never saw the copyrighted works to begin with.

I feel like we don't know if this is true or not. If we decide models trained on copyrighted data aren't fair game, it's possible we'll decide "laundered" data also isn't.

I mean, maybe that's not feasible. And I hope we don't decide training on copyrighted material is bogus anyway. But I don't think we know yet.

But also - you can totally violate copyright of something you never saw.


It's a matter of ensuring the synthetic content is different enough from the referenced content. We can filter.


Sure, but what matters for copyright is output, not input. For now.

If we make the (poor, imo) decision to prevent training on copyrighted data, that's a restriction on the training process, not on its result.

And in the world where we're making bad decisions to put legal restrictions on the training process, "can't train on data obtained by models that were trained without these restrictions" seems on the table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: