Sony v Universal established a very important legal doctrine with regards to "commercially significant non-infringing use". You can take my word for it, you can go an do your own research, you can confirm with an IP lawyer, or you can wait for the the court's opinion.
Or I guess you can give me a little bit of time to go and help you do some of your own research, which I will do right now, so just hold on a bit!
I'm not trying to trick people or win some hypothetical argument, I'm trying to help people see how the courts will consider these issues and how and why we should agree with their rulings that these tools are fair use of copyright protected works!
I love copyright, probably more than most people on these forums, but the liability needs to be on the people using these tools. Supabase Clippy is fantastic and they should not bear any costs, even implicitly through OpenAI paying royalties. Someone who releases a "Sarah Anderson Cartoon Maker" tool using Stable Diffusion should still be found to wrong Sarah Anderson's common law right to publicity just as someone who releases a "Vacation Photo Background Cleaner-Upper" tool using Stable Diffusion should not bear any costs, even implicitly through StabilityAI paying royalties to Sarah Anderson.
These must be considered in their capacities as tools regardless of how they were made and for reasons foundational to the law itself: How can you prove that I used a given tool such as Stable Diffusion or the "Vacation Photo Background Cleaner-Upper" tool without concrete evidence of the tool, such as the presence of the software on my laptop? Can law enforcement get a warrant to search based on literally no visual evidence that Sarah Anderson's works were somehow used in the training process of a tool that I have on my private property?
Edit:
---
In the Texas Law Review in March, 2021, Mark Lemley, a Stanford law professor, and Bryan Casey, then a lecturer in law at Stanford, posed a question: "Will copyright law allow robots to learn?" They argue that, at least in the United States, it should.
"[Machine learning] systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted," they wrote, adding that copyright law isn't the right tool to regulate abuses.
But when it comes to the output of these models – the code suggestions automatically made by the likes of Copilot – the potential for the copyright claim proposed by Butterick looks stronger.
"I actually think there's a decent chance there is a good copyright claim," said Tyler Ochoa, a professor in the law department at Santa Clara University in California, in a phone interview with The Register.
In terms of the ingestion of publicly accessible code, Ochoa said, there may be software license violations but that's probably protected by fair use. While there hasn't been a lot of litigation about that, a number of scholars have taken that position and he said he's inclined to agree.
Both Lemley and Ochoa state that the models themselves are probably protected by fair use. Meaning, it is perfectly fine for OpenAI to train their models on publicly accessible copyright protected works without asking for permission or having to pay any royalties or having to adhere to any of the terms of the license.
They are also free to distribute this tool and to charge for people to use this tool.
What Ochoa is saying about there being a good chance of there being a copyright claim is that this tool doesn't absolve the users of the tool from copyright violation. The liability is on the person using the tool regardless of the tool being used. It's not the intent that matters with copyright, it's that you ended up publishing something that looks enough like someone else's picture that twelve people consider it to be not too different than a simple photocopy.
Now, this can still be a problem for Copilot because an engineer's company might not want to be injecting a lot of copyright protected code into their products, but for the most part the outputs from Copilot have been non-infringing. That it sometimes produces infringing code does not matter to anyone other than the person using Copilot.
Ochoa goes on in detail about what is and isn't covered by copyright with regards to code, which is one of the things that gives me confidence to use Copilot and know that I'm not putting myself at risk:
But in terms of where Copilot may be vulnerable to a copyright claim, Ochoa believes LLMs that output source code – more so than models that generate images – are likely to echo training data. That may be problematic for GitHub.
"When you're trying to output code, source code, I think you have a very high likelihood that the code that you output is going to look like one or more of the inputs, because the whole point of code is to achieve something functional," he said. "Once something works well, lots of other people are going to repeat it."
Ochoa argues the output is likely to be the same as the training data for one of two reasons: "One is there's only one good way to do it. And the other is [you're] copying basically an open source solution.
"If there's only one good way to do it, OK, then that's probably not eligible for copyright. But chances are that there's just a lot of code in [the training data] that has used the same open source solution, and that the output is going to look very similar to that. And that's just copying."
In other words, the model may suggest code to solve a problem for which there's only really one practical solution, or it's copying from someone's open source that does the same thing. In either case, that's probably because a lot of people have used the same code, and that shows up a lot in the training data, leading to the assistant regurgitating it.
So in practice it is pretty easy to tell that Copilot is spitting out purely functional suggestions basically all of the time as there isn't really any other way to wire up a unit test or call a specific API.
Ironically, if Copilot gets better at "software architecture" then it starts to cross over into the expressive parts of software that are indeed covered by copyright, meaning these issues of liability become harder to discern to the end user and enough of a problem that GitHub would want to figure out attribution or somehow "clear" the suggestions for the user.
Edit: after the text you added, I believe we are essentially in agreement. IF it is accepted that the way GPT3 was created is a fair use of the works in the training data, THEN I fully agree that (1) OpenAI has every right to sell it even if (2) some uses of it would still constitute copyright infringement, since (3) only specific users would be liable for copyright infringement in their uses.
Where we differ is in how certain we are that the IF is true. I for one believe there is a good chance that the training of an LLM on copyrighted works does infringe on the copyright of those works (if no other exceptions apply, such as the LLM being trained only for academic research purposes, of course).
My original response:
Where Sony v Universal definitely applies though is when evaluating whether OpenAI's selling of GPT3 to others who then use it to create copyright-infringing works would make OpenAI liable for contributory infringement. Here, the similarities are crystal clear, and the conclusion is simple: since there clearly exist non-infringing uses of GPT3 (such as Supabase Clippy), OpenAI is fully in the clear to sell GPT3, just as much as Sony was for selling the VCR.
However, this assumes that OpenAI has the rights to the IP of GPT3 itself in the first place, which is a prerequisite to them being allowed to sell it at all. Sony certainly had the rights to the IP of the VCR - Universal never claimed that the VCR was a derivative work of their movies.
Essentially, in Sony v Universal, Universal was claiming (1) that Sony was liable for contributory infringement, since (2) all customers of Sony who used it to record and then playback a Universal show were guilty of copyright infringement. The court established that (2) was in fact fair use, and from there automatically (1) become false, since now there was an established legal way of using the Sony product.
But, in a hypothetical OpenAI v Universal, Universal could plausibly claim that (1) OpenAI is liable for copyright infringement directly, since they are distributing GPT3 , (2) which is a derived work of Universal's IP used in the training set of GPT3.
I honestly think you're more certain because you want the courts to rule in a certain manner.
I'm more certain because I'm thinking about this separately from my opinions as to how the courts will rule. Yes, I will just so happen to agree with that ruling because I happen to agree with the logic and knowledge contained in our legal process. I agree that the existing legal doctrines already capture the spirit of what we are asking them to judge. I agree that the statutory law, case law and doctrine that informs their judgement will successfully balance both the limited rights of copyright holders and the natural rights of a public to unburdened access to the arts, knowledge and information. I agree with their process of balancing the potential impact on existing commercial practice with the potential impact on new forms of commercially significant non-infringing practices.
Some things in copyright might just seem unfair, like the case of Baker v Selden:
In 1859, Charles Selden obtained copyright in a book he wrote called Selden's Condensed Ledger, or Book-keeping Simplified. In it the book described an improved system of book-keeping. The books contained about twenty pages of primarily book-keeping forms and only about 650 words. In addition, the books contained examples and an introduction. In the following years Selden made several other books, improving on the initial system. In total, Selden wrote six books, though, evidence suggests that they were really six editions of the same book.
Selden, however, was unsuccessful in selling his books. He originally believed he could sell his system to several counties and the United States Department of the Treasury. Those sales never happened. Selden was forced to assign his interest—an interest that apparently was returned to his wife after his death in 1871.
In 1867, W.C.M. Baker produced a book describing a very similar system. Unlike Selden, Baker was more successful at selling his book–selling it to some 40 counties within five years.
Selden's widow, Elizabeth Selden, hired an attorney, Samuel S. Fisher, a former Commissioner of Patents. In 1872, Fisher filed suit against Baker for copyright infringement.
The poor old widow lost. Boohoo. But this was a just ruling!
I'm not sure that the people who think that ChatGPT is guilty of copyright infringement are thinking about the issue in a balanced manner. Luckily our courts probably will!
One strategy that the defense could use to lower their risk profile is to allow open access to their models and allow an entire ecosystem of commercially significant non-infringing uses to blossom because they are aware of how the courts will be influenced based on existing statutory and legal doctrine...
Or I guess you can give me a little bit of time to go and help you do some of your own research, which I will do right now, so just hold on a bit!
I'm not trying to trick people or win some hypothetical argument, I'm trying to help people see how the courts will consider these issues and how and why we should agree with their rulings that these tools are fair use of copyright protected works!
I love copyright, probably more than most people on these forums, but the liability needs to be on the people using these tools. Supabase Clippy is fantastic and they should not bear any costs, even implicitly through OpenAI paying royalties. Someone who releases a "Sarah Anderson Cartoon Maker" tool using Stable Diffusion should still be found to wrong Sarah Anderson's common law right to publicity just as someone who releases a "Vacation Photo Background Cleaner-Upper" tool using Stable Diffusion should not bear any costs, even implicitly through StabilityAI paying royalties to Sarah Anderson.
These must be considered in their capacities as tools regardless of how they were made and for reasons foundational to the law itself: How can you prove that I used a given tool such as Stable Diffusion or the "Vacation Photo Background Cleaner-Upper" tool without concrete evidence of the tool, such as the presence of the software on my laptop? Can law enforcement get a warrant to search based on literally no visual evidence that Sarah Anderson's works were somehow used in the training process of a tool that I have on my private property?
Edit:
---
In the Texas Law Review in March, 2021, Mark Lemley, a Stanford law professor, and Bryan Casey, then a lecturer in law at Stanford, posed a question: "Will copyright law allow robots to learn?" They argue that, at least in the United States, it should.
"[Machine learning] systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted," they wrote, adding that copyright law isn't the right tool to regulate abuses.
But when it comes to the output of these models – the code suggestions automatically made by the likes of Copilot – the potential for the copyright claim proposed by Butterick looks stronger.
"I actually think there's a decent chance there is a good copyright claim," said Tyler Ochoa, a professor in the law department at Santa Clara University in California, in a phone interview with The Register.
In terms of the ingestion of publicly accessible code, Ochoa said, there may be software license violations but that's probably protected by fair use. While there hasn't been a lot of litigation about that, a number of scholars have taken that position and he said he's inclined to agree.
https://www.theregister.com/2022/10/19/github_copilot_copyri...
https://texaslawreview.org/fair-learning/
Both Lemley and Ochoa state that the models themselves are probably protected by fair use. Meaning, it is perfectly fine for OpenAI to train their models on publicly accessible copyright protected works without asking for permission or having to pay any royalties or having to adhere to any of the terms of the license.
They are also free to distribute this tool and to charge for people to use this tool.
What Ochoa is saying about there being a good chance of there being a copyright claim is that this tool doesn't absolve the users of the tool from copyright violation. The liability is on the person using the tool regardless of the tool being used. It's not the intent that matters with copyright, it's that you ended up publishing something that looks enough like someone else's picture that twelve people consider it to be not too different than a simple photocopy.
Now, this can still be a problem for Copilot because an engineer's company might not want to be injecting a lot of copyright protected code into their products, but for the most part the outputs from Copilot have been non-infringing. That it sometimes produces infringing code does not matter to anyone other than the person using Copilot.
Ochoa goes on in detail about what is and isn't covered by copyright with regards to code, which is one of the things that gives me confidence to use Copilot and know that I'm not putting myself at risk:
But in terms of where Copilot may be vulnerable to a copyright claim, Ochoa believes LLMs that output source code – more so than models that generate images – are likely to echo training data. That may be problematic for GitHub.
"When you're trying to output code, source code, I think you have a very high likelihood that the code that you output is going to look like one or more of the inputs, because the whole point of code is to achieve something functional," he said. "Once something works well, lots of other people are going to repeat it."
Ochoa argues the output is likely to be the same as the training data for one of two reasons: "One is there's only one good way to do it. And the other is [you're] copying basically an open source solution.
"If there's only one good way to do it, OK, then that's probably not eligible for copyright. But chances are that there's just a lot of code in [the training data] that has used the same open source solution, and that the output is going to look very similar to that. And that's just copying."
In other words, the model may suggest code to solve a problem for which there's only really one practical solution, or it's copying from someone's open source that does the same thing. In either case, that's probably because a lot of people have used the same code, and that shows up a lot in the training data, leading to the assistant regurgitating it.
So in practice it is pretty easy to tell that Copilot is spitting out purely functional suggestions basically all of the time as there isn't really any other way to wire up a unit test or call a specific API.
Ironically, if Copilot gets better at "software architecture" then it starts to cross over into the expressive parts of software that are indeed covered by copyright, meaning these issues of liability become harder to discern to the end user and enough of a problem that GitHub would want to figure out attribution or somehow "clear" the suggestions for the user.