Unicode isn't hard, dealing with software that doesn't use it is.

deathanatos · on May 29, 2017

Software doesn't use it because our language and our system's support for effectively dealing with this stuff is utter garbage. For example:

* The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

* Linux, and most (all?) POSIX OSs store filenames as a sequence of bytes. What human chooses a sequence of bytes to "name" their files?

* Things like "how wide will this character display as in my terminal" are either impossible, or done with heuristics. Usually, it's not done at all; most DB CLIs I've used that output tabular data will corrupt the visual if any non-ASCII is output.

(Yes, some of this is in the name of "backwards compatibility".)

jcranmer · on May 29, 2017

> * The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

Saying for (let ch of str) in JavaScript iterates over the codepoints, not UCS-2 codepoints.

deathanatos · on May 30, 2017

TIL! (Though, note that both indexing and .length operate in code units in JS.)