Encodings are fundamentally hard in our current code environment. If your language doesn't make you explicitly think about encodings, you are writing bad code. Period. Full stop. If your language does make you think about encodings and you just make it go away with compiler incantations or just bumbling about until the problems seems to go away, sort of, as long as you don't poke them too hard, you are writing bad code. Period. Full stop. If your language has no support at all for encodings, may God have mercy on your soul.
That said, "convert everything to your native Unicode format at the edges and reconvert it back out at the edges" is at least a tolerable answer. You still lose things, but it puts you ahead of most programs. But few environments make even that really easy, because it turns out to be difficult to identify all the edges; sure, your web framework may emit and send unicode (and then again it may not...), but did you read files off your disk in the correct encoding? Does your database correctly handle encoding? Does all the other code that ever inputs or outputs anything handle Unicode correctly? Do you ever store something in a system that is really just for storing binary blobs, and forget about the encoding?
It's hard, it's tedious, and from what I've seen it's even harder and more tedious than it has to be because so little of the system is usually built to make it work right, because the people creating all your libraries were either ignorant or perhaps even contemptuous of the issues.
I have often thought about what change I would make in 1970 if I could to fix a lot of modern code. Eliminating the null-delimited buffer is definitely number one, but explaining that there is no such thing as a "string" without an encoding label would be number two. Anywhere I see a "string" in the input or output specification for a function I just cringe.
This has little to nothing to do with the current situation in ruby.
> That said, "convert everything to your native Unicode format at the edges and reconvert it back out at the edges" is at least a tolerable answer.
It obviously isn't for the ruby developers. If it were so they had chosen utf8 as internal encoding, which they didn't because they didn't consider this a tolerable answer. Even though you can get ruby 1.9 to work this way, this approach could still cause some headache.
"This has little to nothing to do with the current situation in ruby."
I was addressing the complaint that the encoding in Ruby is hard now, and it broke working code. Encoding is fundamentally hard, and if encoding used to be easy it is almost certainly because your old code got it wrong, and your old code probably wasn't working. I emphasize the "probably" because it is faintly possible that your old code really did work and now it really doesn't work, in which case I would understand the frustration, but if I were giving odds on the chance that the old code actually handled everything correctly I'd open the bidding at somewhere around 5:1 for a superstar encoding expert (working in a language with poor encoding labelling support), with the odds getting worse the further from that you get. There are some things that are just hard without language support even for experts.
> Encodings are fundamentally hard in our current code environment. If your language doesn't make you explicitly think about encodings, you are writing bad code. Period. Full stop. If your language does make you think about encodings and you just make it go away with compiler incantations or just bumbling about until the problems seems to go away, sort of, as long as you don't poke them too hard, you are writing bad code. Period. Full stop. If your language has no support at all for encodings, may God have mercy on your soul.
This, to me, is the fundamental flaw. By now we should be able to have a single encoding that takes up as little space as possible while supporting every known character, and leaving room for more. Most machines now are at least 32-bit... that's almost 4.3 billion characters. Surely there are less than that in the world?
One of my "Grand Lifetime Projects" is to build a new programming language and a new OS built with it. Part of that will certainly be handling strings in an efficient way, both in terms of computer and programmer time. I have some ideas swirling around for creating One True Encoding that allows for extensibility.
"This, to me, is the fundamental flaw. By now we should be able to have a single encoding that takes up as little space as possible while supporting every known character, and leaving room for more."
That's pretty much UTF-8. If you're going to stuff everything into one encoding, you are going to have to make tradeoffs.
See also UCS-4, which is simply "throw 32 bits at every character". Nobody uses it because it makes everything pretty big. (At least a 3-byte CJK character tends to mean more on average than a single English character.)
If you haven't already, take some time to read over the Unicode standards. It is very enlightening. This is especially true if you want to make the "one true encoding"; gotta know what the bar is to beat, right? There's way more to Unicode than just "Here's a catalog of all possible characters and here's numbers for them all", and there are reasons why there's way more than that.
Re: characters not covered by Unicode (or BMP). Yes, they exist, but they are red herrings. Unregistered characters will always be with us. I fully guarantee that. But for vast range of applications, Unicode does work.
I am from Korea, and one of the treasure of this country is Tripitaka Koreana, compilation of Buddhist texts carved in the 13th century. It is 52,382,960 characters long, Wikipedia tells me. There is a whole institute devoted to this document. This institute started to encode it in machine readable form from 1993 and completed the first draft in 2000. In the process they discovered 23,385 new letterforms not registered anywhere. There are many such encoding projects yet to be completed. So yeah, Unicode won't cover everything. That is given. And that's okay.
Sure upgrading to Ruby 1.9.x is a hassle (character encodings, changes to array class, etc. does break some old code). That said, 1.9 gives a good performance boost that Ruby needs, so man up and just do it.
It is also a "public good" issue: the sooner everyone up-converts to 1.9, the easier it will be to develop with Ruby because all required gems will work, etc. I have whined quite a bit on up convert hassles on rubyplanet.net, so I do understand the author's pain + complaints, but we do all need to move forward.
The sooner? ruby 1.9 is around for more than a year??? now. 1.8 is still the standard in ubuntu and about everywhere else.
With respect to performance boost: Startup time didn't improve. And IIRC the same is true for certain string operations. Both are important in the field where ruby originated and where I personally still find it most useful -- scripting or rather as perl-replacement.
We use ruby at Spiceworks and are internally switching to 1.9.1. We are doing it for performance as well as internationalisation. While the encoding was an issue upfront (and we have guys converting our app from 1.8.6 as their primary focus), we do the edge-UTF8 approach, and we've updated a lot of the gems to 1.9 without waiting for others.
I am not sure anyone else has it right, meaning Java and Python. I have read some article a month or 2 back, detailing what others have done -- none are proper/complete solutions.
But from what I've seen in ruby discussion, a lot of rubyists are having problems with the new encoding system.
In C#, everything is assumed to be UTF-8 unless you explicitly change the encoding (the language-independent runtime is a bit more complicated). The only exception is where it is likely that some input in not UTF-8 (such as a byte array), in that case, you have to explicitly define the encoding to use. Works pretty well, never really had encoding problems in C# and I had my share of dealing with non-latin characters in my apps.
I really can't recall the details but there are still some Asian character sets (Chinese iirc) that are not fully covered. The "two-way conversion" fails. And this was with UTF-16 iirc.
Some people may not like it, but this is exactly why I chose to force UTF-8 for http://github.com/brianmario/mysql2 for all the strings you get back (in 1.9), and the connection itself.
We've all dealt with improper use of encodings between applications, their persistence layer and their presentation. It's a nightmare unless you put your foot down and say "We're making everything Unicode, nothing comes in or leaves unless it is".
This obviously doesn't work for everyone but it's my experience that it will work for 99% of all use cases.
The author doesn't seem to understand the difference between an encoding and a character set. We already have a character set that denotes any possible character in the universe: Unicode. We also have several encodings that allow us to reference each of the characters in the set, the most well-known of which is UTF-8. However, UTF-8 is optimized for the Western code points, which is why alternate encodings exist. Moreover, there is all kinds of data in legacy encodings that we want to work with. Encodings are hard, but you can't go shopping without '?' showing up in your apps.
The biggest problem is that everything, after a clean install, seems to default to latin1 or ASCII, so the first thing you need to do is run around to every single piece of software your app touches (the db, web forms, the database, the OS, etc) and make it send and receive unicode. And God help you if you forget one.
I never understood how to do unicode in Ruby 1.8, though. Or I could never be sure would put it more succinctly. Especially not with Rails, somehow I could not even find anything about that via Google. It seemed to somehow work, but I want to know what is going on - is my stuff utf-8 or not.
As I understand it, Ruby 1.8 strings are simply treated as non-encoded sequences of bytes. It has some support for UTF-8, so as long as you can use UTF-8 and ISO-8859-1 (latin1, since the mapping is the same), just think about it as passing around byte sequences and don't need to do a lot of string manipulation, it's not horrible to work with. However, you do need to jump through hoops to use a number of the standard string methods like #length with multibyte characters (often, this means running the the string through #scan(/./mu) before working with it: http://blog.grayproductions.net/articles/bytes_and_character...). Rails also has a UTF-8 handler to help with manipulation of UTF-8 strings with multibyte characters, and I think it's on by default (http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/H...)
In a sentence, Ruby 1.8 Strings are just a set of bytes, so methods like length count the number of bytes, Ruby 1.9 Strings are a set of characters, so length counts characters. This makes a difference when you have multi-byte characters, and want an accurate length, a correct split, reverse etc.
I know that Strings are just non-encoded sequences of bytes. But how do I know that/how Rails works? Ie does it use some regular expressions somewhere that might fail, or not fail? When I was new to Rails they added an extra method to strings to convert it to a kind of unicode string, but somehow later that was lost.
As I said, it seems to kind of work, but I would like to be sure what is going on.
I think there are trickier issues which a programming language has to deal with. For example, concatenating 2 strings using different encoding. Conversion is required. Specifying encoding is required.
And btw, here there is a loss since all character sets are not fully covered even in UTF-16 (iirc). I am trying to recall- maybe Matz gave a details reply somewhere on the net.
> And btw, here there is a loss since all character sets are not fully covered even in UTF-16 (iirc)
If they aren't covered in UTF-16, they wouldn't be covered in UTF-8 or UCS-4 either. All modern Unicode encodings (i.e. not UCS-2) can encode exactly the same data.
I'd be curious to know which character sets Unicode doesn't cover yet a different encoding system does.
That said, "convert everything to your native Unicode format at the edges and reconvert it back out at the edges" is at least a tolerable answer. You still lose things, but it puts you ahead of most programs. But few environments make even that really easy, because it turns out to be difficult to identify all the edges; sure, your web framework may emit and send unicode (and then again it may not...), but did you read files off your disk in the correct encoding? Does your database correctly handle encoding? Does all the other code that ever inputs or outputs anything handle Unicode correctly? Do you ever store something in a system that is really just for storing binary blobs, and forget about the encoding?
It's hard, it's tedious, and from what I've seen it's even harder and more tedious than it has to be because so little of the system is usually built to make it work right, because the people creating all your libraries were either ignorant or perhaps even contemptuous of the issues.
I have often thought about what change I would make in 1970 if I could to fix a lot of modern code. Eliminating the null-delimited buffer is definitely number one, but explaining that there is no such thing as a "string" without an encoding label would be number two. Anywhere I see a "string" in the input or output specification for a function I just cringe.