Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't like the way UTF-8 was clipped to only 1 million codepoints in 2003 to match the UTF-16 limit. The original 2.1 billion codepoint capacity of the original 1993 UTF-8 proposal would've been far better. Go Lang uses \Uffffffff as syntax to represent runes, giving the same upper limit as the original UTF-8 proposal, so I wonder if it supports, or one day will support, the extended 5- and 6-byte sequences.

In fact, UTF-16 doesn't really have the 1 million character limit: by using the two private-use planes (F and 10) as 2nd-tier surrogates, we can encode all 4-byte sequences of UCS-32, and all those in the original UTF-8 proposal.

I suspect the reason is more political than technical. unicode.org (http://www.unicode.org/faq/utf_bom.html#utf16-6) says "Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger intergers, these policies mean that all encoding forms will always represent the same set of characters. Over a million possible codes is far more than enough for the goal of Unicode of encoding characters, not glyphs. Unicode is not designed to encode arbitrary data."



What use would it have to have so much extra codepoint space?


2 planes (130,000) of private-use codepoints aren't enough, and because the top 2 planes of Unicode are designated private use, UTF-16 gives developers the option of extending them to 2.1 billion if they need it. I've wanted extra private-use space for generating Unihan characters by formula in the same way the 10,000 Korean Hangul ones are generated from 24 Jamo. I'm sure many other developers come across other scenarios where 130,000 isn't enough for private use.

I'm simply saying that UTF-8 shouldn't be crippled in the Unicode/ISO spec to 21 bits, but be extended to 31 bits as originally designed because the technical reason given (i.e. because UTF-16 is only 21 bits) isn't actually true. The extra space should be assigned as more private use characters. (Except of course the last two codepoints in each extra plane would be nonchars as at present, and probably also the entire last 2 planes if the 2nd-tier "high surrogates" finish at the end of a plane.)


Part of the reason this is a problem is because someone probably said "Who could need more than 16 bits' worth of codepoints?", so I'd err on the side of extra codepoint space.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: