If a human learned only on tokenized representations of words I don't know that they would be as good at inferring the numbers of letters in the words in teh underlying tokens as llms.
While true, it is nevertheless a very easy test to differentiate humans from LLMs, and thus if you know it you can easily figure out who is the human and who is the AI.
It is a solvable problem, just not a very interesting or useful one, which is why no one (but possibly scammers) is currently employing letter-counter detectors and agents.
Sure, I'm not saying this to diss language models, I'm saying that (1) specifically this failure mode means they were only passing the Turing test in the specific time before this became a well-known trick to detect them as non-human, and (2) it took an unusual kind of human to realise this trick, it wasn't obvious to most of the normal people using them.
I don't know for sure, but I suspect most people right now are using style and tone as authorship hints, which is even easier to get around by adding "reply in style of ${famous writer}" to the custom instructions.
Context length compute complexity scales N^2 with input, so moving to give it the individual letters would just hurt context length vs an optimal tokenization.
We could still have it paste the tokens into python and count the letters in hidden thinking traces if we wanted to solve that part of the Turing test instead and focus on useful things, but solving the Turing test is basically solving a deception goal instead of working on useful assistants. It's not really the goal of these systems outside of their use in North Korean scam bots etc.
I still think it's useful to say we've essentially solved the Turing test even if there are these caveats about how it is optimized in practice.