Why? A great part of the important material for knowledge and mandatory intellectual exercise is not "open" (possibly "accessible", yet under copy rights).
It’s difficult to argue that a model is truly “open” if the creator won’t even tell you what they trained it on. Even as companies like Meta argue that training on copyrighted material is OK, they still don’t want to openly admit that that’s what they did - and providing their training data, which would likely be a giant list of books from LibGen, would give the game away.
This is a good place to interject a point about copyright training that people overlook, conflate, or don't realize.
There are two types of "training on copyright material"
1.) Training on material that is copyrighted and behind a paywall, but you circumvent the paywall. This is unambiguously illegal, as the material is pay to view.
2.)Training on material that is copyrighted, but free for anyone to consume. This is ambiguous in legality, but right now seems to be leaning in AI training favor - as long as the models don't verbatim share the material.
There is also another point about copyright material being ad-supported, and obviously the AI doesn't view/care about ads. There is a decent case to be made that this is illegal, but then is ad blocking actually theft?
Yeah, I'm not fundamentally opposed to them training on copyrighted material, though I expect at some point to be thoroughly frustrated while trying to access such a dataset.
I do take personal affront when they dump models without specifying their datasets. They're just polluting the information space at that point.
Right, the original LLaMA paper. LLaMA 2 and 3 are significantly more capable models trained on orders of magnitude more data, and those papers notably do not say where the data comes from. The LLaMA 3 paper helpfully mentions that "Much of the data we utilize is obtained from the web," so I guess that's better than nothing!
It's important to note that AMD is an IP company, Meta is a data company. AMD would be shooting itself in the foot if it normalized flagrantly violating the IP of others. Meta doesn't care about IP, they just want to sell ads and data.
You must have misunderstood the post: the OP wrote «great to see that they even trained it on open datasets», to which I replied "why should that be «great»: machines with sought "intellectual" abilities must be trained on a larger corpus than just open".
The point is not about any «AMD's fault». It is about "why would it be great to have LLMs trained on limited (open) data".