Most of the code isn't specific to a model. It happens that LLaMA is approximately the best LLM currently available to the public to run on their own hardware, so that's what people are doing. But as soon as anyone publishes a better one, people will use that, using largely the same code, and there is no reason it couldn't be open source.
I'm also curious what the copyright status of these models even is, given the "algorithmic output isn't copyrightable" thing and that the models themselves are essentially the algorithmic output of a machine learning algorithm on third party data. What right does Meta have to impose restrictions on the use of that data against people who downloaded it from The Pirate Bay? Wouldn't it be the same model if someone just ran the same algorithm on the same public data?
(Not that that isn't an impediment to people who don't want to risk the legal expenses of setting a precedent, which models explicitly in the public domain would resolve.)
> I'm also curious what the copyright status of these models even is
That's my question as well. The models are clearly derivative works based on other people's copyrighted texts.
Only a twisted court system would allow Google/OpenAI/Facebook to build models on other people's work and then forbid other people to build new models based on GOF's models.
> That's my question as well. The models are clearly derivative works based on other people's copyrighted texts.
That's not that clear either. (Sometimes it's more clear. If you ask the model to write fan fiction, and it does, and you want to claim that isn't a derivative work, good luck with that.)
But the model itself is essentially a collection of data. "In Harry Potter and the Philosopher's Stone, Harry Potter is a wizard" is a fact about a work of fiction, not a work of fiction in itself. Facts generally aren't copyrightable. If you collect enough facts about something you could in principle reconstruct it, but that's not really something we've seen before and it's not obvious how to deal with it.
That's going to create a practical problem if the models get good enough to e.g. emit the full text of the book on request, but the alternative is that it's illegal to make a model that knows everything there is to know about popular culture. Interesting times.
> "...Harry Potter is a wizard" is a fact about a work of fiction, not a work of fiction in itself
But LLMs aren't trained to learn facts like "Harry is a wizard", they're trained to reproduce specific expressions like "You're a wizard, Harry".
That is, they're trained by prompting them with a selection from a (probably copyrighted) text and weights are adjusted to make it more likely they'll output the next word of the text.
They're not a collection of general facts, they're a collection of estimates about which word follows which other words, and the order of words is the essence of copyright in text.
A probability distribution isn't the order of words, it's a fact about the order of words.
Pedants have been complaining about this kind of thing for years. If you generate random data, no one has a copyright on that. But if you XOR it with a copyrighted work, the result is indistinguishable from random data. No one could tell you which was generated randomly and which was derived from the copyrighted work. But XOR them back together again and you get the copyrighted work.
Things like that get solved pragmatically, not mathematically. There is no basis for saying that one set of random bits is infringing and the other isn't, but if you're distributing them for the sole purpose of allowing people to reconstitute the copyrighted work, you're going to be in trouble.
Now we have something with different practicalities. The purpose of training the model on existing works is so that it can e.g. answer questions about Harry Potter, which the majority wants to be possible and is the same class of thing that search engines need to be able to do. But the same model can then produce fan fiction as an emergent property, so what now?
I'm also curious what the copyright status of these models even is, given the "algorithmic output isn't copyrightable" thing and that the models themselves are essentially the algorithmic output of a machine learning algorithm on third party data. What right does Meta have to impose restrictions on the use of that data against people who downloaded it from The Pirate Bay? Wouldn't it be the same model if someone just ran the same algorithm on the same public data?
(Not that that isn't an impediment to people who don't want to risk the legal expenses of setting a precedent, which models explicitly in the public domain would resolve.)