Ya there are concepts in programming and math that are mostly self-teachable fro...

Ya there are concepts in programming and math that are mostly self-teachable from first principles, but then there's what looks like gibberish because it's too new to have been distilled down into something tractable yet. I would say that arrays and matrices are straightforward to understand, while tensors are not. So I'm disappointed that so much literature currently revolves around tensors. Same for saying embedding instead of just vector representation, etc.

It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.

Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:

https://news.ycombinator.com/item?id=40213292 (Alice’s Adventures in a differentiable wonderland)

https://arxiv.org/pdf/2404.17625 (pdf)

https://news.ycombinator.com/item?id=40215592 (tensor and NN layer breadcrumbs)

  II A strange land 105
    7 Convolutional layers 107
      ..
      7.1.3 Translational equivariant layers 112
    ..
    9 Scaling up the models 143
      ..
      9.3 Dropout and normalization 151
        9.3.1 Regularization via dropout 152
        9.3.2 Batch (and layer) normalization 156
  
  III Down the rabbit-hole 167
    10 Transformer models 169
      10.1 Introduction 169
        10.1.1 Handling long-range and sparse dependencies 170
        10.1.2 The attention layer 172
        10.1.3 Multi-head attention 174
      10.2 Positional embeddings 177
        10.2.1 Permutation equivariance of the MHA layer 177
        10.2.2 Absolute positional embeddings 179
        10.2.3 Relative positional embeddings 182
      10.3 Building the transformer model 182
        10.3.1 The transformer block and model 182
        10.3.2 Class tokens and register tokens 184
    11 Transformers in practice 187
      11.1 Encoder-decoder transformers 187
        11.1.1 Causal multi-head attention 188
        11.1.2 Cross-attention 189
        11.1.3 The complete encoder-decoder transformer 190
      11.2 Computational considerations 191
        11.2.1 Time complexity and linear-time transformers 191
        11.2.2 Memory complexity and the online softmax 192
        11.2.3 The KV cache 194
        11.2.4 Transformers for images and audio 194
      11.3 Variants of the transformer block 197