There is no reason "computer text" couldn't also span the whole unicode characte...

anon1385 · on Jan 16, 2014

Lots of operations people want to do on 'strings' don't make sense on unicode text. There is no 'length'. There is the size in bytes of various encodings and there is the number of code points and there is the number of grapheme clusters. The latter is often what people really want, but to get that you need to know about the fonts being used to render the text.

Similarly people want to iterate over a string character by character or take substrings by range but with unicode text that becomes iteration over code points and ranges of code points (unless you go all the way and use a text rendering system to give you grapheme clusters). Code points can be decomposed diacritic marks etc so you can't just blindly insert or change code points at a certain index or take arbitrary substrings without risking breaking the string (you can end up with accents on characters that you didn't intend, or stranded at the end of a string and probably plenty of other types of breakage that I can't even think of). Functionality exists to deal with all this but it's pretty burdensome (e.g. NSString has -rangeOfComposedCharacterSequencesForRange:).

That all adds up to a pretty hefty performance penalty as well as potential layering violations (needing to consider fonts and rendering when parsing some protocol if you really are going to treat strings as a sequence of grapheme clusters).

bjourne · on Jan 17, 2014

It certainly is possible to split text into grapheme clusters without involving any font rendering, see: http://www.unicode.org/reports/tr29/ Most text manipulation isn't performance critical, and when it is, you can always implement a fast-path for mostly ascii text.

asgard1024 · on Jan 17, 2014

I actually would prefer the computer string type to be array of bytes. Many people mentioned this type distinction already exists in many languages (Python, Haskell..).

Though I think it would be useful to think about it as a sort of subtype of human string, with default encoding in UTF-8. So substitution or concatenation of human and computer string would yield a human string (which is where these languages usually fail short, because you need explicit conversion, it doesn't work like e.g. integers and floats).