Why can't we train a model only on public domain materials and if anything is co...

slyall · 2024-12-30T03:16:32 1735528592

Because copyright last for 75+ years so there is relatively little that is public domain.

Why should training be subject to copyright? (and at what stages of the process). People learn most of what they know from copyright media, giving copyright owners more control over that might be a bad idea

But AI companies are paying content owners for for access recently (partially to reduce legal risk, partially to get access to material not publicly available) but then giving deep-pocket Incumbent a monopoly might be a bad idea.

WillAdams · 2024-12-30T04:10:51 1735531851

Because training intrinsically involves making a copy.

Perhaps requiring that deep-pocket companies actually compensate copyright holders would be a starting point for a fairer system?

ticulatedspline · 2024-12-30T05:23:13 1735536193

Mostly I see two outcomes. Either the holders "lose" and it basically follows "human" rules meaning you can train a model on any (legally obtained) material but there's restrictions on use and regurgitation. I say "human" rules because that's basically how people work. Any artist or writer worth their salt has been trained on gobs of copyrighted material, and you can totally hire people to use their knowledge to violate copyright now.

the other option is the holders "win" and these models must only be trained on owned material, in which case the market will collapse into a handful of models controlled by companies that already own huge swaths of intellectual property. basically think DisneyDiffusion or RandomHouse-LLM. Nobody is getting paid more but it's all above board since it's been trained on all the data they have rights to. You might see some holders benefit if they have a particularly large and useful dataset, like Reddit or the Wall Street Journal.

intended · 2024-12-30T05:59:16 1735538356

Both no?

People with power and money, can get paid. Artists who have no reach and recognition get exploited. Especially those from countries which arent in north America and Europe.

bruce511 · 2024-12-30T05:09:38 1735535378

Should we extend test model to what university professors can teach?

Should we extend that model to text books? If I learn about a topic from a book, can I never write a book of my own on that topic?

Should we extend that model to the web? If I learned CSS and JavaScript reading StackOverflow, am I banned from writing a book, giving classes, or indeed even answering questions on those topics?

I ask this in seriousness. I get that LLM training is new, and it's causing concerns. But those concerns have existed forever - the dissemination of information has been going on a long time.

I'm sure the same moral panic existed the first time someone started making marks in clay to describe what us the best time of year to plant the crops.

WillAdams · 2024-12-30T11:40:20 1735558820

No, because none of those things involve creating a copy on a computer which will then be regurgitated w/o acknowledgement of what went before, and w/o any sort of compensation to the previous rights bearer.

Time was if a person read multiple books to write a new text, they either purchased them, or borrowed them from a library which had purchased them, and then acknowledged them in a footnote or reference section.

At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).

wittjeff · 2024-12-30T14:22:14 1735568534

> At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).

Really? I can't tell if you're joking, so I'll take it at face value.

See, I associate the earliest famous (I thought) expression of that concern with Plato, and before today I couldn't remember any other associated details enough to articulate them with confidence. ChatGPT tells me, using the above quote without the citation as a prompt, that it was in Plato in his dialogue Phaedrus, and offers additional succinct contextual information and a better quote from that work. I probably first learned to associate that complaint about writing with Plato in college, and probably got it from C.D.C. Reeve, who was a philosophy professor and expert on Plato at the college I attended. But I feel no need to cite any of Reeve's works when dropping that vague reference. If I were to use any of Reeve's original thoughts related to analysis of Plato, then a reference would be merited.

It seems to me that there are different layers of abstraction of knowledge and memory, and LLMs mostly capture and very effectively synthesize knowledge at layers of abstraction that are above that of grammar checkers and below that of plagiarism in most cases. It's true that it is the nature of many of today's biggest transformers that they do in some cases produce output that qualifies as plagiarism by conventional standards. Every instance of that plagiarism is problematic, and should be a primary focus of innovation going forward. But in this conversation no one seems to acknowledge that the bar has been moved. The machine looked upon the library, and produced some output, therefore we should assume it is all theft? I am not persuaded.