It is great to see that they even trained it on open datasets. More AI models ne...

ogrisel · 2025-03-06T15:07:10 1741273630

It appears that they reused a lot of the data preparation provided by the AllenAI team:

https://github.com/AMD-AIG-AIMA/Instella

mdp2021 · 2025-03-06T13:19:29 1741267169

Why? A great part of the important material for knowledge and mandatory intellectual exercise is not "open" (possibly "accessible", yet under copy rights).

rafram · 2025-03-06T13:42:42 1741268562

It’s difficult to argue that a model is truly “open” if the creator won’t even tell you what they trained it on. Even as companies like Meta argue that training on copyrighted material is OK, they still don’t want to openly admit that that’s what they did - and providing their training data, which would likely be a giant list of books from LibGen, would give the game away.

Workaccount2 · 2025-03-06T15:13:12 1741273992

This is a good place to interject a point about copyright training that people overlook, conflate, or don't realize.

There are two types of "training on copyright material"

1.) Training on material that is copyrighted and behind a paywall, but you circumvent the paywall. This is unambiguously illegal, as the material is pay to view.

2.)Training on material that is copyrighted, but free for anyone to consume. This is ambiguous in legality, but right now seems to be leaning in AI training favor - as long as the models don't verbatim share the material.

There is also another point about copyright material being ad-supported, and obviously the AI doesn't view/care about ads. There is a decent case to be made that this is illegal, but then is ad blocking actually theft?

NewJazz · 2025-03-06T15:08:37 1741273717

Yeah, I'm not fundamentally opposed to them training on copyrighted material, though I expect at some point to be thoroughly frustrated while trying to access such a dataset.

I do take personal affront when they dump models without specifying their datasets. They're just polluting the information space at that point.

mdp2021 · 2025-03-06T14:48:34 1741272514

> It’s difficult to argue that a model is truly “open” if the creator won’t even tell you what they trained it on

But the original poster said being glad «they even trained it on open datasets», not glad "they told you what they trained it on".

rafram · 2025-03-06T16:06:21 1741277181

I read "open dataset" as "training data URLs are in a CSV somewhere," but I see how you could've read it differently.

perfmode · 2025-03-06T14:34:38 1741271678

they stated it in the original llama paper

rafram · 2025-03-06T16:11:56 1741277516

Right, the original LLaMA paper. LLaMA 2 and 3 are significantly more capable models trained on orders of magnitude more data, and those papers notably do not say where the data comes from. The LLaMA 3 paper helpfully mentions that "Much of the data we utilize is obtained from the web," so I guess that's better than nothing!

LLaMA 2: https://arxiv.org/pdf/2307.09288

LLaMA 3: https://arxiv.org/pdf/2407.21783

bredren · 2025-03-06T14:44:45 1741272285

The 2/27/23 paper? Where?

esafak · 2025-03-06T14:00:45 1741269645

That's not AMD's fault. They are abiding by the law. If IP holders licensed or shared their data, AMD could train on them.

t-3 · 2025-03-06T14:30:14 1741271414

It's important to note that AMD is an IP company, Meta is a data company. AMD would be shooting itself in the foot if it normalized flagrantly violating the IP of others. Meta doesn't care about IP, they just want to sell ads and data.

mdp2021 · 2025-03-06T14:44:26 1741272266

You must have misunderstood the post: the OP wrote «great to see that they even trained it on open datasets», to which I replied "why should that be «great»: machines with sought "intellectual" abilities must be trained on a larger corpus than just open".

The point is not about any «AMD's fault». It is about "why would it be great to have LLMs trained on limited (open) data".