Hacker News new | past | comments | ask | show | jobs | submit login

Unicode isn't hard, dealing with software that doesn't use it is.



Software doesn't use it because our language and our system's support for effectively dealing with this stuff is utter garbage. For example:

* The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

* Linux, and most (all?) POSIX OSs store filenames as a sequence of bytes. What human chooses a sequence of bytes to "name" their files?

* Things like "how wide will this character display as in my terminal" are either impossible, or done with heuristics. Usually, it's not done at all; most DB CLIs I've used that output tabular data will corrupt the visual if any non-ASCII is output.

(Yes, some of this is in the name of "backwards compatibility".)


> * The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

Saying for (let ch of str) in JavaScript iterates over the codepoints, not UCS-2 codepoints.


TIL! (Though, note that both indexing and .length operate in code units in JS.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: