Creating better datasets would also help to improve the performance of the model...

Creating better datasets would also help to improve the performance of the models, I would assume. Unfortunately, the costs to produce high-quality datasets of a sufficient size seem prohibitive today.

I'm hopeful this will be possible in the future though, maybe using a mix of 1) using existing LLMs to help humans filter the existing internet-scale datasets, and/or 2) finding some new breakthroughs to make model training more data efficient.