This explanation walks you through the math and the corresponding code, but (at least in my case, maybe I'm dumb) it failed to help me understand why these steps are necessary or to relate the math to the intended outcome. As a result, I don't feel that I'm any closer to really understanding the heart of self-attention.
At the end of last year I put together a repository to try and show what is achieved by self-attention on a toy example: detect whether a sequence of characters contains both "a" and "b".
The toy problem is useful because the model dimensionality is low enough to make visualization straightforward. The walkthrough also goes through how things can go wrong, and how it can be improved, etc.
It's not terse like nanoGPT or similar because the goal is a bit different. In particular, to gain more intuition about the intermediate attention computations, the intermediate tensors are named and persisted so they can be compared and visualized after the fact. Everything should be exactly reproducible locally too!
I agree. It seems like the target audience is the experienced Deep learning practitioner. Which makes me wonder why such an audience would need this treatment. Why not just read the original paper?