I remember reading that llm’s have consumed the internet text data, I seem to remember there is an open data set for that too. Potential other sources of data would be images (probably already consumed) videos, YouTube must have such a large set of data to consume, perhaps Facebook or Instagram private content
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
the big step was having it reason through math problems that weren't in the training data. even now with web search it doesn't need every article in the training data to do useful things with it.
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI