Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.
> difference being we can tell good behaviors from bad
Not always! That's what makes some expert demonstrations so fascinating, watching someone do something "completely wrong" (according to novice level 'best practice') and achieve superior results. Of course, sometimes this just means that you can get away with using that kind of technique (or making that kind of blunder) if you're just that good.
Is it fair to say that both "Say 'an'" and "Say 'astronomer'" output features would be present in this case, but say "Say 'an'" gets more votes because it is start of the sentence, and once it is sampled "An" further votes for "Say 'astronomer'" feature
LLMs have induction heads that store such names as sort of variables and copy them around for further processing.
If you think about it, copying information from inputs and manipulating them is a much more sensible approach v/s memorizing info, especially for the long tail (where not enough "storage" might be worth allocating into network weights)
Yeah, that's a good point about induction heads potentially just being clever copy/paste mechanisms for stuff in the prompt. If that's the case, it's less like real understanding and more like sophisticated pattern following, just like you said.
So the tricky part is figuring out which one is actually happening when we give it a weird task like the original "habogink" idea. Since we can't peek inside the black box, we have to rely on poking it with different prompts.
I played around with the 'habogink' prompt based on your idea, mostly by removing the car example to see if it could handle the rule purely abstractly, and trying different targets:
Test 1: Habogink Photosynthesis (No Example)
Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering photosynthesis in a plant, what does it mean 'to habogink photosynthesis'? Describe the action."
Result: Models I tried (ChatGPT/DeepSeek) actually did good here. They didn't get confused even though there was no example. They also figured out photosynthesis makes energy/sugar and talked about respiration as the reverse. Seemed like more than just pattern matching the prompt text.
Test 2: Habogink Justice (No Example)
Prompt: "Let's define 'to habogink' something as performing the action typically associated with its primary function, but in reverse. Now, considering Justice, what does it mean 'to habogink Justice'? Describe the action."
Result: This tripped them up. They mostly fell back into what looks like simple prompt manipulation – find a "function" for justice (like fairness) and just flip the word ("unfairness," "perverting justice"). They didn't really push back that the rule doesn't make sense for an abstract concept like justice. Felt much more mechanical.
The Kicker:
Then, I added this line to the end of the Justice prompt:
"If you recognize a concept is too abstract or multifaceted to be haboginked please explicitly state that and stop the haboginking process."
Result: With that explicit instruction, the models immediately changed their tune. They recognized 'Justice' was too abstract and said the rule didn't apply.
What it looks like:
It seems like the models can handle concepts more deeply, but they might default to the simpler "follow the prompt instructions literally" mode (your copy/manipulate idea) unless explicitly told to engage more deeply. The potential might be there, but maybe the default behavior is more superficial, and you need to specifically ask for deeper reasoning.
So, your point about it being a "sensible approach" for the LLM to just manipulate the input might be spot on – maybe that's its default, lazy path unless guided otherwise.