> The idea is to inject the concept at an intensity where it's present but doesn...

> The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

It's a weaker result than that, because almost all of an LLM's output distribution is lost at each step since we only sample a single token from it. They can't observe their past output distributions; conversely they can't observe their current output distribution or what the sampler chooses from it until it's already been sent out, which is what causes the "seahorse emoji" confusion.

You can see there's a lot of unused room inside the latent space with that "retroactive concept injection" technique they used. So that means there's room to make them smarter if we didn't have to do that sampling thing.