> Thomson Reuters prevailed on two of the four factors, but Bibas described the fourth as the most important, and ruled that Ross “meant to compete with Westlaw by developing a market substitute.”
Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.
But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.
It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.
Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed. This is inevitable now that it's at least partially framed in national security terms.
But I'd hope that this means there is a chance that if models have to train on all of human content, the weights will be available for free to all humans. If it requires massive copyright infringement on our content, we should all have an ownership stake in the resulting models.
>But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.
Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder. These aren't small time bloggers on the internet, these are large scale businesses.
>Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed.
The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing. But yes, I'm still frustrated by the hypocrisy.
> Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder.
Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.
> these are large scale businesses.
I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses. Large-scale pretraining is common even for models that are not cutting-edge LLMs.
> The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing
As far as I'm aware, most of the lobbying in favor of stricter copyright has been done by Disney, Universal, Time Warner, RIAA, etc.
Not to say that tech companies have a consistent moral stance beyond whatever's currently in their financial self-interest, but I think that self-interest has put them in a position of supporting fair use and copyright safe harbors, opposing link tax, etc. more often than the the other way around - with cases like Authors Guild v. Google being a significant win for fair use.
>Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.
Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it.
>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.
Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made.
>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,
Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet.
To clarify: veggieroll said training models wouldn't be viable, you said it'd just require licensing like everyone else already manages, I said most other cases don't use millions/billions of works, you're saying that yes they do?
I feel like there must be a misunderstanding here, because that doesn't make much sense to me. Even for making a movie, which I think would be the most onerous of traditional cases, the number of works you'd license would likely be in the dozens (couple of pop songs, some stock images, etc.) - not billions.
> Let's not pretend these companies can't do this through normal channels
I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet". A compulsory licensing scheme, like the US has for cover songs, could make it feasible to pay into a pot - but again I'd really hope for model training to remain accessible to smaller players opposed to just "meh, OpenAI has billions".
> but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective.
As far as I'm aware, DeepSeek is not a low-data model. In fact, given China's more lax approach to copyright, I would not be surprised if the ability to freely pass around shadow libraries and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
> If we need that much data, there are clearly optimizations to be made.
I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all.
> Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright.
I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. Again not to say these tech companies are acting in any way except their own self-interest, just that they've generally been more pro-fair-use than pro-strict-copyright on average to my knowledge.
> I said most other cases don't use millions/billions of works, you're saying that yes they do?
I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz).
I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced.
>I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet"
There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas.
I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
>and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence).
Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
>I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all
Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den.
I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here.
>I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis.
I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case.
> I assumed we were talking about logistics, not tech
Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands.
> Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz
* GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0]
* DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1]
Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries.
> > > Let's not pretend these companies can't do this through normal channels.
> > I'm not sure that there really has been a normal channel [...]
> There isn't.
Then, surely it's not just pretending?
A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets.
> So they'd need to do it the old fashioned way with agreements .
I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements.
> I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright.
> Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt.
The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use.
You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking.
If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot).
> Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc.
I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies.
> Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch.
I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups.
> AI can evolve organically but it instead devolved into a thieve's den.
Unclear what you mean by organically - evolution still needs data.
> the current method for training requires this volume of data
This is one of those things that signal how dumb this technology still is - or maybe how smart humans are when compared to machines. A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.
> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
Maybe not directly, but consider that our brains are the product of million of years of evolution and aren't a blank slate when we're born. Even though babies can't speak a language at birth, they already have all the neural connections in place in order to acquire and manipulate language, and require just a few years of "supervised fine tuning" to learn the actual language.
LLMs, on the other hand, start with their weights at random values and need to catch up with those million years of evolution first.
Add to this, the brain is constantly processing raw sensory data from the moment it became viable, even when the body is "sleeping". It's using orders of magnitude more data than any model in existence every moment, but isn't generally deemed "intelligent" enough until it's around 18 years old.
It’s unlikely that sensory data contributes to cognitive ability in humans. People with sensory impairments, such as blind people, are not less cognitively capable than people without sensory impairments. Think of Helen Keller, who, despite taking in far less sensory information than the average person, was still more intelligent than average.
Without sensory data there cannot be actual cognitive ability, though there may be potential for it. The data doesn't have to be visual; bear in mind we have 5 senses. When vision is impaired, hearing becomes far more sensitive to compensate. And theoretically, if someone were to only have use of a single sense, they may still be able to use the data from it to actualize their cognition, but it would take a lot more effort and there would be large gaps in capability. Just as, technically, preprocessed vision* is the primary "sense" of LLMs.
* Preprocessed since the data is actually of 1D streams of characters, and not 2D colour points (as with vision models).
sadly, those weights will not be inherited like they would to a baby. They'll be cooped up until the company dies, and that data probably dies with them. No wonder LLM has allegedly hit some stalls already.
> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
> I remember talking with friends 30 years ago
I'd say you're pretty old. How many years of training did it take for you to start producing good output?
The leason here is we're kind of meta-trained: our minds are primed to pick up new things quickly by abstracting them and relating them to things we already know. We work in concepts and mental models rather than text. LLMs are incredibly weak by comparison. They only understand token sequences.
That's the point I think. It should be possible to require orders of magnitude less data to create an intelligence, and we are far from achieving that (including achieving AGI in the first place even with those huge amounts of data).
My point is it took a very large amount of data for a human to be able to "produce good output". Once it had its performance was of a different strata though.
We are unbelievably far from that. Everyone who tells you that we're within 20 years of emulating brains and says stuff like "the human brain only runs at 100 hertz!" has either been conned by a futurist or is in denial of their own mortality.
Absolutely! But the question is whether the next step-change in intelligence is just around the corner (in which case, this legal speedbump might spur innovation). Or, will the next revolution take a while.
There's enough money in the market to fund a lot of research into totally novel underlying methods. But if it takes too long, investors and lawmakers will just move to make what already works legal, because it is useful.
> I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.
Why would it be?
"It's inevitable that the Burj Khalifa gets built, once steel production gets high enough."
"It's inevitable that Pegasuses will be bred from horses, as soon as somebody collects enough oats."
Reducing intelligence to the bulk aggregate of brute "calculation power" is... Ironically missing the point of intelligence.
> So the models are legitimately not viable without massive copyright infringement.
Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.
> simply training a model on illegally distributed text should not be copyright infringement
You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).
One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
If that mechanical process is not reversible, then it's not a copyright violation. For instance, I can compute the SHA256 hashes for every book in existence and distribute the resulting table of (ISBN, SHA256) and that is not a copyright violation.
That's actually within the other fair use factors. So your hash table is fair use because its transformative and doesn't substitute for the original work.
It's actually even less than fair use, it's non-copyright use: one-way hashes are intentionally designed to eliminate the creative element and output random looking data.
>One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.
What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?
What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?
That's an interesting take, but false in a lot of juristictions. Even if we ignore question of if the model can distribute work, in many places even downloading content is illegal. Otherwise the person torrenting a movie would be totally in the clear, or thing about what MS would say if a company "just" downloads copies of Windows to use on their computers without ever distributing them.
>Otherwise the person torrenting a movie would be totally in the clear
Any examples of people being sued for merely downloading? "Torrenting" basically always involves uploading, even if you stop immediately after completion. A better test would be if someone was sued for using an illegal streaming site, which to my knowledge has never happened.
I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.
But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)
If you buy a machine that prints copies of copyrighted books (built into the machine), and you use that machine and then distribute the resulting copies, and the machine didn't come with a license allowing you to do so, I'm pretty sure that you are liable as well.
At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.
You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.
well I think that will be the final judgement. We'll treat training data more as distribution than as consumption. Things always get more complicated when you put stuff up for sale. I also can't necessarily get away with Making "Garry Botter" who got accepted into an Enchanter school and goes on adventures with Jon and Germione. Unless it's parody, you can only cut so close before you're just infringinng anyway despite making it legally distinct.
> Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI.
"a person reading" and "computer processing of data" (training) are not the same thing
MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement
Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.
But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.
It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.
Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed. This is inevitable now that it's at least partially framed in national security terms.
But I'd hope that this means there is a chance that if models have to train on all of human content, the weights will be available for free to all humans. If it requires massive copyright infringement on our content, we should all have an ownership stake in the resulting models.