Applying copyright law more and more to things like software - and now to AI models - in other words, the status quo, makes little sense.
What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.
It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).
Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.
What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.
It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).
Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.