Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is no reason "computer text" couldn't also span the whole unicode character set. I don't think your comment is tangential. But I think the notion that some characters are special or more basic than other characters is a trap and leads to illogical thought. For example, many internet protocols are defined in that way. Ascii only for all commands and responses and then for i18n there are some crazy encoding schemes used (quoted printable, utf7, base64, etc) to encode unicode as ascii.

All that goes away if your protocol is standardized on utf8. Then text is text and bytes is bytes.



Lots of operations people want to do on 'strings' don't make sense on unicode text. There is no 'length'. There is the size in bytes of various encodings and there is the number of code points and there is the number of grapheme clusters. The latter is often what people really want, but to get that you need to know about the fonts being used to render the text.

Similarly people want to iterate over a string character by character or take substrings by range but with unicode text that becomes iteration over code points and ranges of code points (unless you go all the way and use a text rendering system to give you grapheme clusters). Code points can be decomposed diacritic marks etc so you can't just blindly insert or change code points at a certain index or take arbitrary substrings without risking breaking the string (you can end up with accents on characters that you didn't intend, or stranded at the end of a string and probably plenty of other types of breakage that I can't even think of). Functionality exists to deal with all this but it's pretty burdensome (e.g. NSString has -rangeOfComposedCharacterSequencesForRange:).

That all adds up to a pretty hefty performance penalty as well as potential layering violations (needing to consider fonts and rendering when parsing some protocol if you really are going to treat strings as a sequence of grapheme clusters).


It certainly is possible to split text into grapheme clusters without involving any font rendering, see: http://www.unicode.org/reports/tr29/ Most text manipulation isn't performance critical, and when it is, you can always implement a fast-path for mostly ascii text.


I actually would prefer the computer string type to be array of bytes. Many people mentioned this type distinction already exists in many languages (Python, Haskell..).

Though I think it would be useful to think about it as a sort of subtype of human string, with default encoding in UTF-8. So substitution or concatenation of human and computer string would yield a human string (which is where these languages usually fail short, because you need explicit conversion, it doesn't work like e.g. integers and floats).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: