On this note I want to point at "Low-Rank Bottleneck in Multi-head Attention Mod...

On this note I want to point at "Low-Rank Bottleneck in Multi-head Attention Models", which details how attention inherently needs the query dimension to match or exceed the sequence length to allow precise (and especially, sharp) targeting.

It may be that dimension-starved pretrained transformer models rely heavily on context being correctly "tagged" in all relevant aspects the very instant it's inserted into the KV cache, e.g. necessitating negation to be prefixed to a fact instead of allowing post-fix negation. The common LLM chat case is telling the model it just spewed hallucination/wrong claims, and hoping this will help instead of hurt downstream performance as the chat continues. There specifically the negation is very delayed, and thus not present in most tokens that code the hallucinated claims in the KV cache, and thus for lack of sufficient positional precision due to insufficient dimensionality, the transformer can't retroactively attribute the "that was wrong" claim in a retrievable matter to the hallucination tokens.

The result of course being the behavior we experience: hallucinations are corrected by editing the message that triggered them to include discouraging words, as otherwise the thread will become near-useless from the hallucination context pollution.

I do wonder if we have maybe figured out how to do this more scalable than just naively raising the query dimension to get (back?) closer to sequence length.

[0]: https://arxiv.org/abs/2002.07028