You are right, but it's a little misleading (as it sounds like is the usefulness of your work nowadays) - Comparisons on language modelling prowess of BERT/T5 being compared to the default, non-instruct GPT-3 or OPT-3 isn't really that useful if done by size, because in practice we don't use 1.3B generative models, and more importantly, because focusing on default decoding generation without an instruct/PPO step is not how these models are used practically. The instruct models blow this performance out of the water, but instruct plus better-performance-at-size for GPT models completely shows the dominance of decoder-only architectures in my opinion for now.
I think you have to consider that in 2020/2021 many PhDs and Professors attempted to shift grant funded research with BERT and T5 to explore how they could compete with GPT-3 or to express other properties of it that supposedley outdid GPT-3. Very few (besides sentence transformers) succeeded. It's not like this is an unexplored niche. A lot of people in denial were trying to keep on with BERT research for a while despite the fact their work was essentially made obsolete by GPT-3.
(and notably Table 1 and Figure 4 are cherrypicking the smallest size with the largest gaps in task difference, and a size we know decoding is not performative at - 1.3B param mark - the characteristics and conclusions the authors come to (wow, BERT is trained on less data but does better!) obviously can't be made at larger sizes because the actual GPT models become much larger)
I think you have to consider that in 2020/2021 many PhDs and Professors attempted to shift grant funded research with BERT and T5 to explore how they could compete with GPT-3 or to express other properties of it that supposedley outdid GPT-3. Very few (besides sentence transformers) succeeded. It's not like this is an unexplored niche. A lot of people in denial were trying to keep on with BERT research for a while despite the fact their work was essentially made obsolete by GPT-3.
(and notably Table 1 and Figure 4 are cherrypicking the smallest size with the largest gaps in task difference, and a size we know decoding is not performative at - 1.3B param mark - the characteristics and conclusions the authors come to (wow, BERT is trained on less data but does better!) obviously can't be made at larger sizes because the actual GPT models become much larger)