In my screed, N is the attention width. (How many token it looks at at a time) n...

		throwawaymaths on May 17, 2023 \| parent \| context \| favorite \| on: Ask HN: Can someone ELI5 transformers and the “Att... In my screed, N is the attention width. (How many token it looks at at a time) number of parameters is O(KxNxNxL) where k is the vector size of your tokens, and l is the # of layers. There are other parameters floating around, like in the encoder and decoder matrices, but the NXN matrix dominates.

This is an awesome explanation. You guys are the real heroes