Okay, here's my attempt! First, we take a sequence of words and represent it as ...

Me1000 · on May 17, 2023

This was a very helpful visualization, thank you!

The "entanglement" part intuitively makes sense to me, but one bit I always get caught up on the key, query, and value matrices. In every self-attention explanation I've read/watched they tend to get thrown out there and similar to what you did here but leave their usage/purpose a little vague.

Would you mind trying to explain those in more detail? I've heard the database analogy where you start with a query to get a set of keys which you then use to lookup a value, but that doesn't really compute with my mental model of neural networks.

Is it accurate to say that these separate QKV matrices are layers in the network? That doesn't seem exactly right since I think the self-attention layer as a whole contains these three different matrices. I would assume they got their names for a reason that should make it somewhat easy to explain their individual purposes and what they try to represent in the NN.

benjismith · on May 17, 2023

I'm still trying to get a handle on that part myself... But my ever-evolving understanding goes something like this:

The "Query" matrix is like a mask that is capable of selecting certain kinds of features from the context, while the "Key" matrix focuses the "Query" on specific locations in the context.

Using the Query + Key combination, we select and extract those features from the context matrix. And then we apply the "Value" matrix to those features in order to prepare them for feed-forward into the next layer.

There are multiple "Attention Heads" per layer (GPT-3 had 96 heads per layer), and each Head performs its own separate QKV operation. After applying those 96 Q+K->V attention operations per layer, the results are merged back into a single matrix so that they can be fed-forward into the next layer.

Or something like that...

I'm still trying to grok it myself, and if anyone here shed more light on the details, I'd be very grateful!

I'm still trying to understand, for example, how many QKV matrices are actually stored in a model with a particular number of parameters. For example, in a GPT-NeoX-20B model (with 20 billion params) how many distinct Q, K, and V matrices are there, and what is their dimensionality?

EDIT:

I just read Imnimo's comment below, and it provides a much better explanation about QKV vectors. I learned a lot!

ActorNightly · on May 18, 2023

Its basically almost the same as convolution with image processing. For example, you take the 3 channel rgb value of a single pixel, do some math on it with the values of the surrounding pixels with weights, which gives you some value(s). Depending on the dimensions of everything, you can end up with a smaller dimension output, like a single 3 channel RGB value, or a higher dimension output (i.e for a 5x5 kernel, you can end up with a 9x9 output)

The confusing part that doesn't get mentioned is that the input vectors (Q, K, V) are weighted, i.e they are derived from the input with the standard linear transformation where y = A*x+b, where x is the input word, A is the linear layer matrix, and b is the bias. Those weighs are the things that are learned through the training process.

detrites · on May 17, 2023

That was incredible. Thank you! If you made it into an article with images showing the mask/filter analogy, it might be one of the best/most unique explanations I've seen. Love the ground-up approach beginning with data's shape.

Reminded me of the style of a book on machine learning. If anyone liked this explanation, you may appreciate this book:

https://www.amazon.com/Applied-Machine-Learning-Engineers-Al...

Too · on May 18, 2023

If it only generates one word at a time and then repeat the process again, how does it know when to stop?

It feels like this method would create endless ramblings. But we all know you can ask Chatgpt to “summarize in one sentence” and it pulls it off. When speaking yourself you sort of have to think how to finish a sentence before you start it, to explain something cohesively, surely there must be something similar in the AI?

joshhart · on May 18, 2023

One of the “words” is a stop token, which represents the ending of text. So you can say the thing that maximizes coherence is to stop right then.

throw310822 · on May 18, 2023

I have a very dumb question, I'll just throw it here: I understand word embeddings and tokenisation- and the value of each; but how can the two work together? Are embeddings calculated for tokens, and in that case, how useful are they, given that each token is just a fragment of a word, often with little or no semantic meaning?

lhnz · on May 18, 2023

I've heard that nowadays subword/token embeddings are learned during the training phase, and that they are useful for reconstructing the embeddings of words that contain them, and in fact allow the model to handle typos like "aple" (instead of "apple").

coppsilgold · on May 20, 2023

The way transformers operate is by transforming the embedding space through each layer. You could say that all the "understanding" is happening in that high dimensional space - that of a single token, but multiplied by the number of tokens. Seeding the embedding space with some learned value for each token is helpful. Think of it as just a vector database: token -> vector.

Decoder-only architectures (such as GPT) mask the token embedding interaction matrix (attention) such each token embedding and all subsequent transformations only have access to preceeding token embeddings (and transforms). This means that on output, only the last transformed token embedding has the full information of the entire context - and only it is capable of making predictions for the next token.

This is done so that during training, you can simultaneously make 1000s (context size) of predictions - every final token embedding transform is predicting the next token. The alternative (Encoder architecture, where there is no masking and the first token can interact with the final token) would result in massively inefficient training for predicting the next token as each full context can only make a single prediction.

tomhamer · on May 20, 2023

Disclaimer - someone from Marqo here.

Marqo supports E5 models: https://github.com/marqo-ai/marqo

hackernewds · on May 18, 2023

why are the words cols and properties are rows. seems counter intuitive

parpfish · on May 19, 2023

just tilt your head 90 degrees and it'll be fine.

this is rows/columns from a math/matrix/tensor perspective where they are the arbitrary first and second dimensions of a data-containing object.

it's not rows/columns from a database perspective where you expect columns to define a static schema and rows to be individual records.

noman-land · on May 20, 2023

Thank you for this.