One thing to consider: we don’t know if these LLMs are wrapped with server-side logic that injects randomness (e.g. using actual code or external RNG). The outputs might not come purely from the model's token probabilities, but from some opaque post-processing layer. That’s a major blind spot in this kind of testing.
The core of an LLM is completely deterministic. The randomness seen in LLM output is purely the result of post processing the output of the pure neural net part of the LLM, which exists explicitly to inject randomness into the generation process.
This what the “temperature” parameter of an LLM controls. Setting the temperature of an LLM to 0 effectively disables that randomness, but the result is a very boring output that’s likely to end up caught in a never ending loop of useless output.
You're right, although tests like this have been done many times locally as well. This issue comes from the fact that RL usually kills the token prediction variance, disproportionately narrowing it to 2-3 likely choices in the output distribution even in cases where uncertainty calls for hundreds. This is also a major factor behind fixed LLM stereotypes and -isms. Base models usually don't exhibit that behavior and have sufficient randomness.