Yuup, that's in the standard: http://www.unicode.org/reports/tr39/tr39-13.html#C...

wahern · on May 30, 2017

There's also NFKC_Casefold, which is a technique used by RFC 5892 (among others) to limit the characters allowable in a ___domain name. The problem is that it also disallows 'A', because Casefold(NFKC('A')) != 'A'. I'm sure that's equally annoying in other languages. And in any event makes it problematic for usages like parsing URLs from free-form Unicode text.

Unicode specifications are incredibly thorough and well thought-out. The problem is that the Unicode spec isn't shippable software. It's not an implementation.

And there's no singular implementation. Worse, nobody uses any particular implementation the same way, and rarely to its fullest extent. Compounding the problems, so much code is _proprietary_. You have no way to verify and track how such code will behave, so interoperability is difficult. For example, good luck trying to reproduce the behavior of Outlook, Mail.app, and gmail.com in terms of how each will highlight URLs in free-form text.

The only saving grace appears to be that the rest of the world, I assume, has grown accustomed to how broken American software is in terms of dealing with I18N issues. And Americans remain blissfully naive. I keep waiting for the other shoe to drop; when managers will finally crack the whip at the behest of international customers and demand that engineers begin taking I18N seriously. But it hasn't happened yet. I've been waiting almost 15 years, accumulating skills and best practices that my employers don't seem to value very much. Oh well....