If you did hide its thinking it could do that. But I'm pretty sure what happens here is that it has to go through those tokens for it to be clear that it's doing things wrong.
What I think that happens:
1. There's a question about a somewhat obscure thing.
2. LLM will never know the answer for sure, it has access to this sort of statistical, probability based compressed database on all the facts of the World. Because this allows to store more facts by relating things to each other, but never with 100% certainty.
3. There are particular obscure cases where it hits its initial "statistical intuition" that something is true, so it starts outputting its thoughts as expected for a question where something is likely true. Perhaps you could analyze what it's indicating probabilities on "Yes" vs "No" to estimate its confidence. Perhaps it will show much less likelihood for "Yes", than if the question was for a horse emoji, but in this case "Yes" is still high enough threshold to go through instead of "No".
4. However when it has to explain the exact answer, it's impossible to output an answer because it's false. E.g. seahorse emoji does not exist and it has to output it, previous tokens where "Yes, it exists, it's X", the X will be answers semantically close in meaning.
5. The next token will have context that "Yes, seahorse emoji exists, it is "[HORSE EMOJI]".
Now it's clear that there's a conflict here, it's able to see that HORSE emoji is not seahorse emoji, but it had to output it in the line of previous tokens because the previous tokens statistically required an output of something.
What I think that happens:
1. There's a question about a somewhat obscure thing.
2. LLM will never know the answer for sure, it has access to this sort of statistical, probability based compressed database on all the facts of the World. Because this allows to store more facts by relating things to each other, but never with 100% certainty.
3. There are particular obscure cases where it hits its initial "statistical intuition" that something is true, so it starts outputting its thoughts as expected for a question where something is likely true. Perhaps you could analyze what it's indicating probabilities on "Yes" vs "No" to estimate its confidence. Perhaps it will show much less likelihood for "Yes", than if the question was for a horse emoji, but in this case "Yes" is still high enough threshold to go through instead of "No".
4. However when it has to explain the exact answer, it's impossible to output an answer because it's false. E.g. seahorse emoji does not exist and it has to output it, previous tokens where "Yes, it exists, it's X", the X will be answers semantically close in meaning.
5. The next token will have context that "Yes, seahorse emoji exists, it is "[HORSE EMOJI]". Now it's clear that there's a conflict here, it's able to see that HORSE emoji is not seahorse emoji, but it had to output it in the line of previous tokens because the previous tokens statistically required an output of something.