It is quite likely GPT-4 uses one or even two sparsity approaches on top of each other (namely, coarse grained switch transformer-like and fine grained intra-tensor block sparsity), if you look at the openly available contributors' research CVs.
Google, in collaboration with OpenAI, has published an impressive tour de force where they have throughly developed and validated at scale a sparse transformer architecture, applied to general language modeling task: https://arxiv.org/abs/2111.12763
This happened in November of 2021, and there is a public implementation of this architecture on the Google's public github.
Impressively, due to some reasons, other up-and-coming players are still not releasing models trained with this approach, even though it promises multiplicative payoff in inference economy.
One boring explanation is conservatism for NN training at scale, where training runs cost O(yearly salary).
Let's hope the open source side of things catches up.
Google, in collaboration with OpenAI, has published an impressive tour de force where they have throughly developed and validated at scale a sparse transformer architecture, applied to general language modeling task: https://arxiv.org/abs/2111.12763
This happened in November of 2021, and there is a public implementation of this architecture on the Google's public github.
Impressively, due to some reasons, other up-and-coming players are still not releasing models trained with this approach, even though it promises multiplicative payoff in inference economy. One boring explanation is conservatism for NN training at scale, where training runs cost O(yearly salary).
Let's hope the open source side of things catches up.