I'm not sure if training on a vast amount of content is really necessary in the ...

I'm not sure if training on a vast amount of content is really necessary in the sense that linguistic competence and knowledge can probably be separated to some extent. That is, the "ChatGPT" paradigm leads to systems that just confabulate and "makes shit up" and making something radically more accurate means going to something retrieval-based or knowledge graph-based.

In that case you might be able to get linguistic competence with a much smaller model that you end up training with a smaller, cleaner, and probably partially synthetic data set.