I think it applies because lookups can be done by multiplying one-hot vectors (o...

I think it applies because lookups can be done by multiplying one-hot vectors (or masks) with matrices; this is roughly analogous to what happens when we multiply Q with K^T in a self-attention head.

Read this: https://e2eml.school/transformers.html#table_lookup

And then read: https://e2eml.school/transformers.html#attention

Matrix multiplication is differentiable as it is continuous. Therefore you can calculate partial derivatives of these operations. The ability to do that is what allows gradient descent optimization via the chain rule.

  > Differentiable relative to what?
  > What is (x) in the d(hashtable)/d(x) equation?

I think the derivative we actually care about is `∂L/∂(hashtable-lookup)` but here the `hashtable-lookup` is actually the matrix multiplication mentioned above. We want to know how much the "loss" (or error) changes with respect to each of our "lookups". Knowing how each of our "lookups" causes our model to err from the output we expect, allows us to optimise it.

Note: it's not really a "lookup" in the normal sense of the word. Hashtable lookups are discontinuous since we either have a value for a particular key or we do not. Our matrix multiplication is sort of like a fuzzy, probabilistic, continuous form of lookup in which nearby keys get partially activated based on their similarity to the query, and in which a small change in this query causes continous changes to the keys produced.

As far as my understanding of the self-attention equation (e.g. `softmax(QK^T / sqrt(d_k))V`) goes, its actually quite important that we get this fuzzy output in which lots of keys get partially activated for a particular query. If it only picked the maximum similarity and ignored the rest, there would be less information propagating through the network and it'd be harder for the network to learn relationships/interactions between inputs. This is why we scale `QK^T` by `sqrt(d_k)` in order to pass a tighter range of values into the `softmax()` function (which importantly generates probabilities that sum to 1, but contains exponentials which give it a tendency to over-emphasize the maximum value and ignore other values if they are too far apart).