I am pretty sure they do, this data is just too valuable. At least meta admitted using a dataset called "books3" which contains ~200k pirated ebooks for llama 1 and 2 [1].
Anna's archive provides datasets for LLM training, but who knows who they are working with..
I also wonder if google is using their own dataset from books.google.com .
It took me an embarrassingly long time to realize that it was a joke when people said "It's always the last place you look." Like well into my teens. But ever since I figured it out, I always look at least one more place after finding something.
I think that there are many people who don't recognize it as a joke and pass it on as great wisdom. I was also pretty old when I realized the tautology.
If you're disorganized, you'll search in random places until you find it. Joke applies.
But if you are organized, you'll start with the most likely place and progress to increasingly less likely places. When you find it, there's no surprise, and no one gets much of a chuckle over your efforts.
If you don't understand the problem space, then saying "That's the last place I would have looked!" is an expression of exasperation about your lack of knowledge.
A tautology can still provide an insight by recognizing that two things are really just the same. In that sense I didn’t perceive it as a joke, even though there’s of course some humor attached to it.
Australian parody news site the chaser just announced [1] that they are going to put up a paywall to avoid having their content scraped for AI training. They feel like chatGPT already "is a more competent writer of satire than most of the people we’ve worked with". I guess they are too late
You can probably build a kind of acoustic signature of the whole at various different points in the boundary and check each build is sufficiently close to that signature.
I haven't seen anything like it in Europe either. Helsinki's library is incredible. It's also architecturally one of the most prominent buildings and right next to the parliament. It doesn't have a lot of books, though (100k)
I also wonder if google is using their own dataset from books.google.com .
[1] https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...