Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As far as I remember, SolidGoldMagikarp was a bug caused by millions of posts on reddit by the same user ("SolidGoldMagikarp") in a specific sub-reddit.

There was no problem with the token per se, but the fact it was like a strange attractor in multidimensional space, disconnected from any useful information.

When the LLM was induced to use it in its output, the next predicted token would be random gibberish.



More or less. It was a string given its own token by the tokeniser because of the above, but it did not appear in the training data. Thus it basically had no meaning for the LLM (I think there are some theories that such parts of the networks associated with such tokens may have been repurposed for something else and so that's why the presense of the token in the input messed them up so much)


gpt-oss has similar bad tokens.

https://fi-le.net/oss/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: