I remember reading that llm’s have consumed the internet text data, I seem to re...

Scrounger · 2025-08-08T03:35:20 1754624120

> I remember reading that llm’s have consumed the internet text data

Not just the internet text data, but most major LLM models have been trained on millions of pirated books via Libgen:

https://techcrunch.com/2025/01/09/mark-zuckerberg-gave-metas...

sharemywin · 2025-08-07T20:13:12 1754597592

the big step was having it reason through math problems that weren't in the training data. even now with web search it doesn't need every article in the training data to do useful things with it.

Ferrus91 · 2025-08-08T10:22:43 1754648563

This is using think time compute and reinforcement learning. I think this is going to plateau even faster than the initial LLM scaling though.