> So the models are legitimately not viable without massive copyright infringement.
Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.
> simply training a model on illegally distributed text should not be copyright infringement
You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).
One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
If that mechanical process is not reversible, then it's not a copyright violation. For instance, I can compute the SHA256 hashes for every book in existence and distribute the resulting table of (ISBN, SHA256) and that is not a copyright violation.
That's actually within the other fair use factors. So your hash table is fair use because its transformative and doesn't substitute for the original work.
It's actually even less than fair use, it's non-copyright use: one-way hashes are intentionally designed to eliminate the creative element and output random looking data.
>One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.
What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?
What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?
That's an interesting take, but false in a lot of juristictions. Even if we ignore question of if the model can distribute work, in many places even downloading content is illegal. Otherwise the person torrenting a movie would be totally in the clear, or thing about what MS would say if a company "just" downloads copies of Windows to use on their computers without ever distributing them.
>Otherwise the person torrenting a movie would be totally in the clear
Any examples of people being sued for merely downloading? "Torrenting" basically always involves uploading, even if you stop immediately after completion. A better test would be if someone was sued for using an illegal streaming site, which to my knowledge has never happened.
I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.
But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)
If you buy a machine that prints copies of copyrighted books (built into the machine), and you use that machine and then distribute the resulting copies, and the machine didn't come with a license allowing you to do so, I'm pretty sure that you are liable as well.
At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.
You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.
well I think that will be the final judgement. We'll treat training data more as distribution than as consumption. Things always get more complicated when you put stuff up for sale. I also can't necessarily get away with Making "Garry Botter" who got accepted into an Enchanter school and goes on adventures with Jon and Germione. Unless it's parody, you can only cut so close before you're just infringinng anyway despite making it legally distinct.
> Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI.
"a person reading" and "computer processing of data" (training) are not the same thing
MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement
Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.