Train an LLM with text books and other legal books, you do not need to train it on pop culture to make it intelligent.
For face generations you might need to be more creative, you should not need milions of images stolen from social media to train your model.
But makes sense that tech giants do not want to share their data set and be transparent about stuff.
Without licenses to the books, they are just as illegal (and maybe even moreso) than web content.
There are books that are out of copyright, and also free books.
Copyright sucks.
Train an LLM with text books and other legal books, you do not need to train it on pop culture to make it intelligent.
For face generations you might need to be more creative, you should not need milions of images stolen from social media to train your model.
But makes sense that tech giants do not want to share their data set and be transparent about stuff.