Hacker News new | past | comments | ask | show | jobs | submit login

It is actually hard.

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/

But one's sticking with only some part of Unicode support that they understand/need is easy, sure.




Meh.

It's hard, because there's a lot more to learn and to do than if you stick to (say) ASCII and ignore the problems ASCII can't handle.

It's easy, because if you want to solve a sizable fraction of all the problems ASCII just gives up on, Unicode's remarkably simple.

In the eyes of a monoglot Brit who just wants the Latin alphabet and the pound sign, unicode probably seems like a lot of moving parts for such a simple goal.


Something as simple as moving the insertion point in response to an arrow key requires a big table of code point attributes and changes with every new version of Unicode. Seemingly simple questions like "how long is this string?" or "are these two strings equal?" have multiple answers and often the answer you need requires those big version-dependent tables.

I think Unicode is about as simple as it can possibly be given the complexity of human language, but that doesn't make it simple.


A Brit hoping to encode the Queen's English in ASCII is, I'm afraid, somewhat naïve. An American could, of course, be perfectly happy with the ASCII approximation of "naive", but wouldn't that be a rather barbaric solution? ;)


For anything resembling sanely typeset text you’d also want apostrophes, proper “quotes” — as well as various forms of dashes and spaces. Plus, many non-trivial texts contain words in more than one language. I’d rather not return to the times of in-band codepage switching, or embedding foreign words as images.


This is why the development of character sets requires international coördination from the beginning. :)


Yeah. And then you'll get Latin-1, because everyone using computers is in Western Europe or uses ASCII ;)


But comparing something to something else and it being easy, doesn't make it easy by itself.

Paraphrasing the joke about new standards: we had a problem, so we created a beatiful abstraction. Now we have more problems. One of the new problem being normalization.

It doesn't undermine the good that Unicode brought, but you can't say to have included some unilib.h and use its functions without understanding all the Unicode quirks and its encodings, because some of the parameters wouldn't even make sense to you, like the same normalization forms.


Wait. There are two possible cases:

1. Either your restrict yourself to the kind of text CP437/MCS/ASCII can handle (to name the three codecs in the blog posting). In that case unicode normalisation is a noop, and you can use unicode without understanding all its quirks.

2. Or you don't restrict the input, in which case unicode may be hard, but using CP437/MCS/ASCII will be incomparably harder.


A rocket can take you to the Moon. Is it easy to operate? Or to learn how to? To maintain it and prepare on the ground?

Not just it would be harder, but you couldn't get into space without it at all, so that got comparatively easier.

Is it still all easy, though?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: