There's definitely scientific insight and analysis.
E.g. "In-context Learning and Induction Heads" is an excellent paper.
Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.
The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.
Attention provides information routing. Again, that is pretty well-understood.
The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.
So this architecture is not so much accidental as it is general.
Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.
E.g. "In-context Learning and Induction Heads" is an excellent paper.
Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.
The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.
Attention provides information routing. Again, that is pretty well-understood.
The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.
So this architecture is not so much accidental as it is general.
Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.