my opinion: it quickly gets into "the math behind LLMs" that make no sense to me
words i understand but don't really get: weights, feed forward, layers, tensors, embeddings, normalization, transformers, attention, positioning, vector
There's "programming" in the plumbing sense where you move data around through files/sockets and then there's this... somebody without a math background/education... very unlikely you'll understand it. it's just skimming python and not understand the math/library calls it makes
Ya there are concepts in programming and math that are mostly self-teachable from first principles, but then there's what looks like gibberish because it's too new to have been distilled down into something tractable yet. I would say that arrays and matrices are straightforward to understand, while tensors are not. So I'm disappointed that so much literature currently revolves around tensors. Same for saying embedding instead of just vector representation, etc.
It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.
Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:
II A strange land 105
7 Convolutional layers 107
..
7.1.3 Translational equivariant layers 112
..
9 Scaling up the models 143
..
9.3 Dropout and normalization 151
9.3.1 Regularization via dropout 152
9.3.2 Batch (and layer) normalization 156
III Down the rabbit-hole 167
10 Transformer models 169
10.1 Introduction 169
10.1.1 Handling long-range and sparse dependencies 170
10.1.2 The attention layer 172
10.1.3 Multi-head attention 174
10.2 Positional embeddings 177
10.2.1 Permutation equivariance of the MHA layer 177
10.2.2 Absolute positional embeddings 179
10.2.3 Relative positional embeddings 182
10.3 Building the transformer model 182
10.3.1 The transformer block and model 182
10.3.2 Class tokens and register tokens 184
11 Transformers in practice 187
11.1 Encoder-decoder transformers 187
11.1.1 Causal multi-head attention 188
11.1.2 Cross-attention 189
11.1.3 The complete encoder-decoder transformer 190
11.2 Computational considerations 191
11.2.1 Time complexity and linear-time transformers 191
11.2.2 Memory complexity and the online softmax 192
11.2.3 The KV cache 194
11.2.4 Transformers for images and audio 194
11.3 Variants of the transformer block 197
I recommend _Deep Learning with Python_ by François Chollet (the creator of Keras). It’s very clear and approachable, explains all of these concepts, and doesn’t try to “impress” you with unnecessary mathematical notation. Excellent introductory book.
The only downside is that in 2024, you are probably going to use PyTorch and not Keras + Tensorflow as shown in the book.
If you want to gain familiarity with the kind of terminology you mentioned here, but don't have a background in graduate-level mathematics (or even undergrad really), I highly recommend Andrew Ng's "Deep Learning Specialization" course on Coursera. It was made a few years ago but all of the fundamental concepts are still relevant today.
Fei Fei Li and Andrej Karpathy's Stanford CS231N course is also a great intro to the basic of the math from an engineering forward perspective. I'm pretty sure all the materials are online. You build up from the basic components to an image focused CNN.
That's exactly where I am at. Despite watching Karpathy's tutorial videos, I quickly got lost. My highest level of math education is Calculus 3 which I barely passed. This probably means that I will only ever understand LLMs at a high level.
It's Signals and Systems in part to get the notation and explanation. MIT has the course online for free. (Though probably a little more general than what you need, since the class is also used to prep electrical engineers for robotics and radio communication).
words i understand but don't really get: weights, feed forward, layers, tensors, embeddings, normalization, transformers, attention, positioning, vector
There's "programming" in the plumbing sense where you move data around through files/sockets and then there's this... somebody without a math background/education... very unlikely you'll understand it. it's just skimming python and not understand the math/library calls it makes