I got excited by reading the article about releasing the training data, went to ...

andy99 · 2025-11-21T20:15:28 1763756128

Isn’t this before any curation has happened? I looked at it, I can see why it looks bad, but if they’re really being open about the whole pipeline, they have to include everything. Giving them a hard time for it only promotes keeping models closed.

That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview

Oras · 2025-11-21T20:20:54 1763756454

Hard time? What value does adult videos description, views and comments add to small (7,32B) models?

andy99 · 2025-11-21T20:26:16 1763756776

It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds

khimaros · 2025-11-21T22:09:07 1763762947

what if that's where they learned how to utilize the double entendre? hard times indeed.

logicchains · 2025-11-21T18:14:29 1763748869

Erotic fiction is one of the main use cases of such models.