We introduce a geometric framework for understanding Transformer language models through an analogy with General Relativity. In this view, keys and queries define a curved “space of meaning,” and attention acts like gravity, moving information across it. Layers represent discrete time steps where token representations evolve along curved—not straight—paths shaped by context. Through visualization and simulation experiments, we show that these trajectories indeed bend and reorient, confirming the presence of attention-induced curvature in embedding space.
The KL divergence of distributions P and Q is a measure of how similar P and Q are.
However, the KL Divergence of P and Q is not the same as the KL Divergence of Q and P.
Why?
Learn the intuition behind this in this friendly video.
Imagine you have a red glove. Could you change the color to blue, by only looking at it? In the real world, you can't, but in the quantum world, these kind of phenomena are possible! Learn about it in this friendly video!
Quantum computers are fast. But is there a limit to how fast they are? Learn about the speed of information, Shannon entropy, and information gain, in this video!
I like to see word embeddings as a universe where words fly in spce like planets. The attention mechanism acts like the laws of gravity that govern in this universe. Learn all about it in this video!
reply