Ya there are concepts in programming and math that are mostly self-teachable from first principles, but then there's what looks like gibberish because it's too new to have been distilled down into something tractable yet. I would say that arrays and matrices are straightforward to understand, while tensors are not. So I'm disappointed that so much literature currently revolves around tensors. Same for saying embedding instead of just vector representation, etc.
It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.
Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:
II A strange land 105
7 Convolutional layers 107
..
7.1.3 Translational equivariant layers 112
..
9 Scaling up the models 143
..
9.3 Dropout and normalization 151
9.3.1 Regularization via dropout 152
9.3.2 Batch (and layer) normalization 156
III Down the rabbit-hole 167
10 Transformer models 169
10.1 Introduction 169
10.1.1 Handling long-range and sparse dependencies 170
10.1.2 The attention layer 172
10.1.3 Multi-head attention 174
10.2 Positional embeddings 177
10.2.1 Permutation equivariance of the MHA layer 177
10.2.2 Absolute positional embeddings 179
10.2.3 Relative positional embeddings 182
10.3 Building the transformer model 182
10.3.1 The transformer block and model 182
10.3.2 Class tokens and register tokens 184
11 Transformers in practice 187
11.1 Encoder-decoder transformers 187
11.1.1 Causal multi-head attention 188
11.1.2 Cross-attention 189
11.1.3 The complete encoder-decoder transformer 190
11.2 Computational considerations 191
11.2.1 Time complexity and linear-time transformers 191
11.2.2 Memory complexity and the online softmax 192
11.2.3 The KV cache 194
11.2.4 Transformers for images and audio 194
11.3 Variants of the transformer block 197
It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.
Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:
https://news.ycombinator.com/item?id=40213292 (Alice’s Adventures in a differentiable wonderland)
https://arxiv.org/pdf/2404.17625 (pdf)
https://news.ycombinator.com/item?id=40215592 (tensor and NN layer breadcrumbs)