The paper published by Xiao et al. (2023)[0] states that "a surprisingly large a...

cubie · on Oct 9, 2023

By "irrespective of their relevance to the language modeling task", the authors mean that the semantic meaning of the tokens is not important. These 4 tokens can be completely replaced by newlines (i.e. tokens with no semantic meaning), and the perplexity as measured on a book of 65k tokens is nearly unaffected.

The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.