Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft want court to toss lawsuit accusing them of abusing open-source code (reuters.com)
126 points by ThaDood on Feb 10, 2023 | hide | past | favorite | 63 comments


Is there anything out of the ordinary here? Doesn't basically every lawsuit have the defendant file a motion to dismiss, based on any halfway plausible reason?


it was a successful strategy for VMWare when approached by a German developer about improper licensing for his open source code. VMWare managed to get the original case tossed on a technicality, as well as the appeal, which bought them enough time to drop the linux code entirely and avoid a discovery where they would most certainly be found in violation.

https://www.zdnet.com/article/linux-developer-abandons-vmwar...

https://www.zdnet.com/article/vmware-sued-for-failure-to-com...

https://en.wikipedia.org/wiki/Vmlinux


Not only is the answer "no, the defendant always files a motion to dismiss," it's a good strategy because it forces the plaintiff to say something on the record.


There is not. This is standard operating procedure. Getting a case thrown saves so much money that it is entirely worth having your legal team try to make it happen before the real work starts.


To those interested in watching the particulars of this case, this is not a surprising development. But the play-by-play is interesting. Sports announcers manage to talk continuously during a game, and don't sit there silently and say "team A won with 20 points to team B's 5 points, what a game" at the very end. Personally, I don't care for the sportsguy blathering about a game nor the end results, and prefer to read about legal shenanigans.


Perhaps a meta discussion is needed here regarding the potential ability to dismiss a lawsuit in a scenario like this where everyone understands the existence of a legal problem in the need of future guidance.


I have a feeling that no matter what the outcome is here, it's not going to be satisfying.

One extreme is AI is allowed to spit out copyrighted code verbatim as long as it technically goes through an AI first. Of course that defeats all open-source languages by adding a backdoor around them.

The other extreme is that AI is not allowed to spit out a single line of copyrighted code, in which case we'll have endless lawsuits to figure out if CodeGPT used a GPL-licensed fast inverse square root or if it used the public-domain fast inverse square root.

I think we'll land somewhere in the middle: If an AI regurgitates a "substantial" number of lines of code, then it's creators can be held liable (a.k.a. the "we'll know it when we see it" standard.)


"It violates the licenses that open-source programmers chose and monetizes their code despite GitHub's pledge never to do so."

Microsoft never changes. Always looking for a dishonest buck. Does 'Embrace, Extend, and Extinguish' ring a bell for younger players? Thought not.


Its not just microsoft, its the developer free loading culture.. once we start paying with our instead of free loading then things will change


Why does the post title omit OpenAI, so it no longer matches the article’s title?

> OpenAI, Microsoft want court to toss lawsuit accusing them of abusing open-source code


Character limit.


https://en.m.wikipedia.org/wiki/Licence_laundering

Seems pretty obvious to me but we'll see how it goes in the court.


Huh, I always had this concept in my mind but never knew it actually had a phrase with some legal precedent.


If open source devs aren’t allowed to use the source code of windows to improve react, why the fuck should microsoft be able to copy and paste other people’s code for profit


Open source devs are allowed to learn from proprietary source code, including Windows' decompiled binaries, if they so choose. The fact that ReactOS and Wine have chosen to essentially self-sabotage by adopting a "clean-room" blackbox policy does not mean other projects must do so. Those policies are self-inflicted wunds, not mandated by any legal cases or standards.


How would an open source windows improve react? The tooling?


I think they mean ReactOS.


Is it still Fair Use if you make money with it?


Making money from the work is one of the factors for Fair Use but it is not an automatic-fail kind of situation. A judge/jury would need to hear the facts and consider all the factors.

Here's a good link explaining the Fair Use test: https://copyright.columbia.edu/basics/fair-use.html


But it's pretty clear from the start that Copilot is neither criticism, comment, news reporting, teaching, scholarship, nor research which are covered by fair use.


The law on fair use says "such as" those categories you list but it is not read that you MUST be in one of those categories to enjoy fair-use. In fact, it goes on in the next sentence to say you need to consider the 4 factor test in every case.

> The fair use of a copyrighted work ... for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include ...

https://www.law.cornell.edu/uscode/text/17/107


Sounds even worse for Copilot

(1)the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2)the nature of the copyrighted work; (3)the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4)the effect of the use upon the potential market for or value of the copyrighted work.

It's commercial, they use all of the code to built the model and the original code loses its value because you can get through Copilot. 3 of 4, depending on the original license it's 4 of 4 against fair use


> they use all of the code to built the model

That's not what "portion used" means. If you summarize a book then the portion used is <1%, not the entire book.

> the original code loses its value because you can get through Copilot

That's not even remotely true. You might get a fragment or two but you have to rebuild a program from scratch to replace it.

And as far as "nature of the copyrighted work" it's a completely different beast. It's a programming tool instead of whatever code was fed into it.

Only commerciality is a clear mark against it, and that factor is far from decisive by itself.


Writing a summarize of a book that incorporates 0.1% of the original book is likely fine under fair use.

If you ask a AI to write 1000 unique summaries of the same book, each including a unique 0.1% copy of the book, what you have is a convoluted copying protocol. Asking the AI to write 1000, or 10000, or 10^infinitive number of fragments won't change the fundamentals of what is being done.

In the end you have to ask the question what a judge and jury will say. In the BitTorrent protocol you split a file into tiny fragments of between 32 kB and 16 MB, and in the beginning people did make the claim that such small fragments could not possible be copyrightable. A 32 kB portion of a 10GB movie is so small that it has no significant relationship with the original work. Courts disagreed and people went to jail.


You'd have to keep feeding the book into the AI for it to do that. It hasn't memorized the entire book, and you can't get the whole book back out of it just by asking questions.

Copilot is basically a single interactive summary of all of github.

I would not currently worry about the threat of someone making a thousand wildly different copilots that deliberately memorize different fragments. Especially because I expect the real copilot to be tuned over time to reduce the number of fragments it picks up. But if such a person emerges, it's clear that they are the problematic actor.


If you ask the AI to do that, is the AI doing the infringement - or are you the human doing the infringement?

Was the creator of the BitTorrent protocol liable? or the person transmitting the file?

Is Xerox liable for the copy made? or the person using the copier?


Those question was part of the original discussions that the pirate movement made around 2005. Copyright was viewed as being about a single person copying a work. The person who uploaded shouldn't be allowed to be charged with the same crime, and the site that connected the uploaded and downloaded was just a meeting place.

Lobby organizations for rights owners presented their own theories, one was that concept of "making available" a copyrighted work. They argued that the upload was equally if not more guilty of infringement than the downloaded. They also accused the website owners for facilitating and enabling.

Then came the pirate bay case and a glaring issue struck the pirate movement. If technology can make creative solutions using code to bypass copyright, courts can in turn make creative solutions around law. The law that was used to charge the founders of the pirate bay was originally intended to combat bike bars when those places was used as headquarters for illegal gangs, a far step away from a website hosting files which enable two people on the internet to transfer files.

So we can go around and blame the AI for doing the infringement, or even the researcher who invented the math that created AI, but as with any creative technical solution around copyright we have to ask what creative solutions the lawyers and judges will make.


Was the creator of Napster liable or only the person using it?

If Xerox would allow to copy money they would be liable.


Yeah I think it'll be an uphill battle for copilot, but I'm not 100% convinced yet. One factor can weigh so much more heavily than the others that they still have a chance with 3/4 against them.

The original license of the code used to train on doesn't really matter to the fair use question, that only matters once the fair use defense fails and the court has to decide a remedy.


You're forgetting transformative use. Think about search engines.


Should a search engine be allowed to get around the copyright on a novel by letting me search for and retrieve every sentence individually in small chunks?

Also, there is a difference here. A fair use quote from another work is a reference. It's not the thing, it's referring to the thing.

When copilot takes a chunk of code from another work to put into yours, it's using the thing directly, or rather, you are by using it.

It lacks citation which a quote would have, and instead of being a quote to discuss the other work "Dr Foo once said <remarkable genius insight>" copilot is like you writing a novel and simply copying Dr Foo's remarkable genius insight.

The size of the snippet doesn't matter, it's the usage and the lack of citation.

Even a cheap-ass totally doable collective citation like getting all the contributors to agree to have their works included, and then having a big list of all contributors somewhwere, and then each user just needs to say "includes code from copilot collective" They don't even have that, which would be good enough.


Right but a search engine takes in web pages and outputs search. copilot takes in code and outputs code.


In the USA, yes. For example, the unanimous ruling in Campbell v. Acuff-Rose Music, Inc. determined that parody is fair use, even if the parody is of a commercial nature:

> Held: 2 Live Crew's commercial parody may be a fair use within the meaning of § 107. Pp. 574-594.

The ruling states explicitly that commercial usage can be a determining factor in determining whether usage is fair or not, but that it does not in and of itself make the use "unfair".

https://supreme.justia.com/cases/federal/us/510/569/


Not necessarily. For example, you can make money by writing a review of a book that includes quotes from the book; that is considered fair use. But if you make money by publishing a book that consists solely of quotes from other books that others have copyrighted, on the grounds that this assembly of quotes from other books might be useful to future authors, that would not be fair use.

To me, the latter scenario is much closer to what Github is doing with Copilot, which is one of the things the plaintiffs are alleging violates open source licenses.


I don't think that question is particularly relevant to the case. A newspaper can publish, for profit, a book review which quotes excerpts. As far as I understand it, the case hinges on the distribution of major portions of copyrighted works and derivatives thereof, in violation of their licenses. Likewise, see Aaron Swartz, sci-hub, etc -- distribution of copyrighted works need not be for profit to be a violation.


Yes. Prefect 10 vs Google https://cyber.harvard.edu/people/tfisher/IP/2007%20Perfect%2...

> Additionally, the district court determined that the commercial nature of Google's use weighed against its transformative nature. Although Kelly held that the commercial use of the photographer's images by Arriba's search engine was less exploitative than typical commercial use, and thus weighed only slightly against a finding of fair use, the district court here distinguished Kelly on the ground that some website owners in the AdSense program had infringing Perfect 10 images on their websites. The district court held that because Google's thumbnails "lead users to sites that directly benefit Google's bottom line," the AdSense program increased the commercial nature of Google's use of Perfect 10's images.

> In conducting our case-specific analysis of fair use in light of the purposes of copyright, we must weigh Google's superseding and commercial uses of thumbnail images against Google's significant transformative use, as well as the extent to which Google's search engine promotes the purposes of copyright and serves the interests of the public. Although the district court acknowledged the "truism that search engines such as Google Image Search provide great value to the public," the district court did not expressly consider whether this value outweighed the significance of Google's superseding use or the commercial nature of Google's use. The Supreme Court, however, has directed us to be mindful of the extent to which a use promotes the purposes of copyright and serves the interests of the public.

---

I will also draw attention to:

> The fact that Google incorporates the entire Perfect 10 image into the search engine results does not diminish the transformative nature of Google's use. As the district court correctly noted, we determined in Kelly that even making an exact copy of a work may be transformative so long as the copy serves a different function than the original work.


Is there a OSS license that specifically precludes its use in LLMs or effectively does so?


The thing about fair use is that there’s nothing a license can do to prevent it. After all, that’s the whole point of fair use: to say that there’s valid reasons to use pieces of IP without regards to their licenses.

So, if the courts find in Microsoft and OpenAI’s favor (which remains to be seen despite the many armchair lawyers here), your license would mean jack squat.


They don't aim to. The problem is really just accreditation. If copilot copies a chunk of code for you, chances are the original author was perfectly happy for you to do that, and you put their name somewhere in your credits. Copilot copies the same code, but scrubs the original author. It may also be copying code that was not ok to copy but that's a seperate even worse issue.


Breaking news: litigant wants to win lawsuit.

They probably didn't rigorously track the licensing issue, but I'm pretty sure training a LLM is completely acceptable use of source under Freely licensed code. It would be somewhat amusing though if CoPilot is forced to spit out the license for every piece of code used to develop the derivative work, along with copyright notices and whatever else the licenses may require.


That's the point though, if you recreate the code you need to follow it's license, which typically involves some kind of attribution. Copilot should be forced to spit out a list of all licenses it referenced. That would actually be pretty valuable.


Furthermore, the language model itself is clearly a for profit derivative work and so would be subject to the wants of the original copyright owners and it is clearly a derivative work since without the inputs of the copyrighted code in its training it would be different and likely less effective.

There's a more interesting question about the copyright status of the code it outputs, since the language model is sort of like a compiler, but also not like a compiler since the output is based on other people's copyrighted code.

I feel a lot of people get caught up on the output code and completely ignore the fact that copilot itself is likely a massive copyright violation.


To add on to this discussion, the scale matters too, and this is something many people tend not to factor in.

Copilot breaks the assumptions about the lossy nature of human memorization, so a lawsuit challenging the merits of the activity is at least warranted.


It is absolutely not clear that an ML model is a derivative work. It might be for-profit but there's good arguments that it is incredibly transformative, and that each individual work the model is trained on is minimally important to the model (if you trained the model on every other document in the training set except the one being sued over, the model would perform very similarly). These are factors which will weigh against the copyright holder.


Hold on, there is a difference between “recreate” and “copy”. Copyright only applies to creative expressions. If the code is trivially “recreated”, it’s not particularly creative.

Copyrighted content can be used without the holder’s permission under “Fair Use”.

Don’t assume all code can be copyrighted. Purely functional expressions are not copyrightable. Code is math.

There’s a lot here to unpack.


By the Curry-Howard correspondence no code should be able to be copyrighted since every program is a formal mathematical proof. However judges aren't usually mathematicians with a background in Computing Science so it's of little consequence.


No algorithm should be copyright-able but your expression of that algorithm should. Programming language choice, variable names, comments, code-style, etc are all creative expressions which are relatively independent of the underlying math.


Are they though? They are minor variations in technical language to describe the same mathematical reality.


Wait, are you saying that there is legal precedent for training an LLM with open source code to generate proprietary code?


That depends. Odds are good some GPL code slipped in somewhere, so using the GPL for the whole thing is an option in that case. And sure you can derive proprietary code from GPL code, so long as you don't publish binaries.


I would point to the Oracle vs. Google Supreme Court decision.

https://www.cnn.com/2021/04/05/tech/google-oracle-supreme-co...

> Writing for the Court, Breyer said that while it is difficult to apply traditional copyright concepts in the context of software programming, Google copied “only what was needed to allow users to put their accrued talents to work in a new and transformative program.”

> A world where Oracle was allowed to enforce a copyright claim, Breyer added, “would risk harm to the public” because it would establish Oracle as a new gatekeeper for software code others wanted to use.

The fair use tests that were used in the SCOTUS case, I believe, would fall on the side of "developers using GPT or Copilot to generate code do not generate substantial parts of the code and are below the amount of work needed to show sufficient creativity in writing it."

The example is https://horstmann.com/unblog/2010-11-15/NodePolicyImpl.html

If that is not a copyright violation and considered to be fair use, then the code generated by GPT or Copilot likely also falls in the the same bucket.

I don't necessarily agree with that, but that's my reading of the tea leaves.


I'm not so sure on whether or not it's completely acceptable to train a LLM under GPL, for example. To bring the point home, reverse engineering efforts follow the clean-room design technique. This is done in an effort to not infringe copyrights.

Would love to see this being done on decompiled proprietary code. Training done on it. And released into the wild.

But the amount of data necessary, and computing power to do it might not be available for the common person.


As you describe, a perfectly acceptable outcome is that licenses are respected with respect to attribution and where necessary propagation to derived works.


Reading a source then writing your own code with the same ideas in mind isn't a derived work and shouldn't need attribution. It will be a crying shame if Copilot output is mired in unwarranted legal trouble with how much of a productivity booster it is.


All intellectual property law is a tradeoff between efficiency and monetizability. It would be a huge productivity booster to copy/paste source code wholesale as well. Or to implement existing patents instead of having to work around them. But then the market for software would look very different than it does today.

If that's the future people want, that's fine - but everyone should play by the same rules.


When a computer is fed human-developed code and then reproduces it verbatim elsewhere (which copilot has been shown to do) it is called “copying”.


Well that is indeed the legal question to be answered. The post I replied to asserted the derivative nature.

The ability to copyright-launder via an API will lead to some interesting consequences for sure: I wouldn’t want to be elastic search or mongodb relying on source-available licensing if it comes about.


I think a more appropriate IP model for code is closer to how patents work. The code must be filed and becomes available to/usable by the public after the protection expires.


Maybe that is more appropriate but it is not the reality of today.


I thought everyone knew the person that reads the code can only describe it to someone else who writes the new code. Having the same party do it would make it a copy, if the purpose and construction is the same.

At least that's how compaq beat IBM and started all this monkey business.

Granted I could be full of it, I wasn't alive yet.


> I thought everyone knew the person that reads the code can only describe it to someone else who writes the new code. Having the same party do it would make it a copy, if the purpose and construction is the same.

The same party can do it and not have it count as a copy, but it could be a copy if the party was not careful. So a company that wants to avoid potentially being sued will not allow one party to do both. So jimmaswell is correct to my understanding but a company may want extra legal armor/padding.


[flagged]


And the people who decided to go ahead with implementing it probably need help filing a motion in court. Are you just railing against specialization?


Yeah because thinking its ridiculous people who don't know how to open email decide the future of software cases is railing against specialization :/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: