"Formal equivalence" means very little for engineering, to be frank - the implem...

"Formal equivalence" means very little for engineering, to be frank - the implementation is the important thing. If I wanted to be snarky, I'd say that neural networks are "formally equivalent" to Fourier analysis, which is 200 years old. I see that the paper proposes an implementation of linearized attention as well, which many others have done, but none of which seem to have caught on (although FlashAttention at least makes attention O(n) in memory, if not computation).