Most of the code isn't specific to a model. It happens that LLaMA is approximate...

leereeves · on May 4, 2023

> I'm also curious what the copyright status of these models even is

That's my question as well. The models are clearly derivative works based on other people's copyrighted texts.

Only a twisted court system would allow Google/OpenAI/Facebook to build models on other people's work and then forbid other people to build new models based on GOF's models.

AnthonyMouse · on May 4, 2023

> That's my question as well. The models are clearly derivative works based on other people's copyrighted texts.

That's not that clear either. (Sometimes it's more clear. If you ask the model to write fan fiction, and it does, and you want to claim that isn't a derivative work, good luck with that.)

But the model itself is essentially a collection of data. "In Harry Potter and the Philosopher's Stone, Harry Potter is a wizard" is a fact about a work of fiction, not a work of fiction in itself. Facts generally aren't copyrightable. If you collect enough facts about something you could in principle reconstruct it, but that's not really something we've seen before and it's not obvious how to deal with it.

That's going to create a practical problem if the models get good enough to e.g. emit the full text of the book on request, but the alternative is that it's illegal to make a model that knows everything there is to know about popular culture. Interesting times.

leereeves · on May 5, 2023

> "...Harry Potter is a wizard" is a fact about a work of fiction, not a work of fiction in itself

But LLMs aren't trained to learn facts like "Harry is a wizard", they're trained to reproduce specific expressions like "You're a wizard, Harry".

That is, they're trained by prompting them with a selection from a (probably copyrighted) text and weights are adjusted to make it more likely they'll output the next word of the text.

They're not a collection of general facts, they're a collection of estimates about which word follows which other words, and the order of words is the essence of copyright in text.

AnthonyMouse · on May 5, 2023

A probability distribution isn't the order of words, it's a fact about the order of words.

Pedants have been complaining about this kind of thing for years. If you generate random data, no one has a copyright on that. But if you XOR it with a copyrighted work, the result is indistinguishable from random data. No one could tell you which was generated randomly and which was derived from the copyrighted work. But XOR them back together again and you get the copyrighted work.

Things like that get solved pragmatically, not mathematically. There is no basis for saying that one set of random bits is infringing and the other isn't, but if you're distributing them for the sole purpose of allowing people to reconstitute the copyrighted work, you're going to be in trouble.

Now we have something with different practicalities. The purpose of training the model on existing works is so that it can e.g. answer questions about Harry Potter, which the majority wants to be possible and is the same class of thing that search engines need to be able to do. But the same model can then produce fan fiction as an emergent property, so what now?