I think the weird thing about this is that it's completely true right now but in X months it may be totally outdated advice.
For example, efforts like OpenMOE https://github.com/XueFuzhao/OpenMoE or similar will probably eventually lead to very competitive performance and cost-effectiveness for open source models. At least in terms of competing with GPT-3.5 for many applications.
I also believe that within say 1-3 years there will be a different type of training approach that does not require such large datasets or manual human feedback.
> I also believe that within say 1-3 years there will be a different type of training approach that does not require such large datasets or manual human feedback
This makes a lot of sense. A small model that “knows” enough English and a couple of programming languages should be enough for it to replace something like copilot, or use plug-ins or do RAG on a substantially larger dataset
The issue right now is that to get a model that can do those things, the current algorithms still need massive amounts of data, way more than what the final user needs
> I also believe that within say 1-3 years there will be a different type of training approach that does not require such large datasets or manual human feedback.
I guess if we ignore pretraining, don't sample-efficient fine-tuning on carefully curated instruction datasets sort of achieve this? LIMA and OpenOrca show some really promising results to date.
distilbert was trained from Bert. there might be an angle using another model to train the model especially if your trying to get something to run locally.
For example, efforts like OpenMOE https://github.com/XueFuzhao/OpenMoE or similar will probably eventually lead to very competitive performance and cost-effectiveness for open source models. At least in terms of competing with GPT-3.5 for many applications.
Also see https://laion.ai/
I also believe that within say 1-3 years there will be a different type of training approach that does not require such large datasets or manual human feedback.