Arc's Unicode support (by the news.yc patch writer)

jbert · on Feb 8, 2008

Perl has fairly decent utf8 support, off by default:

    use utf8;

allows source to be written in utf8 (string literals, function+variable names etc).

Filehandles aren't utf8 by default, but can be put into utf8 mode with the 'binmode' function. (You can go further, and tag filehandles with pretty much any charset encoding and you'll get the Right Thing happening with reads and writes).

Other data sources (db handles etc) generally have some API to control this too.

As a convenience -CIO option to perl puts stdin and stdout into utf8.

Defaulting to off is just the price of back-compatability. Not sure what the plan is for perl6 in this regard.

bootload · on Feb 8, 2008

Won't work for me at least till py3k and is a good example of Python not leading.

  $ cat > foo.py
  # -*- coding: <utf-8> -*- 
  # even with the above character encoding
  def ô():
	return "ô"
  $ python foo.py
  File "foo.py", line 3
  SyntaxError: Non-ASCII character 'xc3' in file foo.py on line 3, but no
  encoding declared; see http://www.python.org/peps/pep-0263.html

Nothing you can do about this at the moment as python up till 2.5 is ascii only for strings [0] unless source code encoding is added. It gets worse. Support for "non ascii identifiers" (shown with the bug above) outlined in PEP 3131 is SF, not SA. [1], [2]

Q Do people use Unicode identifiers in their source code?

[0] BDFL made this as a design choice to default to ascii.

[1] http://www.python.org/dev/peps/pep-3131/

[2] http://www.python.org/dev/peps/

anewaccountname · on Feb 9, 2008

Since 'cat > foo.py' erases foo.py, you are obviously lying about that error message.

bayareaguy · on Feb 9, 2008

Hmm... I guess that makes me a liar too.

  powerbook.local 104> cat > foo.py
  # -*- coding: <utf-8> -*-
  def ô():
	return "ô"
  powerbook.local 105> python foo.py
    File "foo.py", line 2
  SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
  powerbook.local 106>

Now if you actually go look at the web page in the error message, you should notice that the encoding doesn't include the < >.

But even when the encoding is set properly python still complains:

  powerbook.local 106> cat > foo.py
  # -*- coding: utf-8 -*-
  def ô():
      return "ô"
  powerbook.local 107> python foo.py
    File "foo.py", line	2
      def ô():
	  ^
  SyntaxError: invalid syntax
  powerbook.local 108>

anewaccountname · on Feb 9, 2008

Are you sure you don't mean 'cat foo.py' or maybe 'cat - < foo.py'?

bayareaguy · on Feb 9, 2008

Yes, I'm sure.

cat > foo.py copies standard input into the file foo.py. The rest of the characters you're seeing in the message are what I typed in. The only character you don't see is the Ctrl-D I type to indicate EOF. This is a quick and easy way of creating foo.py without using an editor.

anewaccountname · on Feb 11, 2008

Ah, now I see.

bootload · on Feb 10, 2008

"... Since 'cat > foo.py' erases foo.py, you are obviously lying about that error message. ..."

True this syntax erases the file. But all I did was cut+paste the example from the original source & added the source code encoding on posting. I knew the example code missed the encoding and I checked the example text for extra unicode characters (another possible source of errors) as I saved to the file & checked with cat. So I tried the example on the CLI before I posted and overlooked the redirection. It's not worth lying, way too many sharp people to pick you up. Just a simple display mistake.

Btw, the inference I'm lying. Does that also apply to the original author as well?

"... Some time later ..."

Doing a bit of digging. It appears the 'cat > foo.py' works like this ...

"... Redirection of output causes the file whose name results from the expansion of word to be opened for writing on file descriptor n , or the standard output (file descriptor 1) if n is not specified. If the file does not exist it is created; if it does exist it is truncated to zero size. ..." [0]

So my theory is you have tried this with a full file & it has erased itself. Thanks for the pick-up though. It is a fine distinction and something I've not come across before. We are both right but at different times.

[0] http://www.wlug.org.nz/bash%281%29Part5

marvin · on Feb 8, 2008

The referenced flamewars from comp.lang.lisp are, once again, priceless. I can't for the love of God understand that community.

iamelgringo · on Feb 8, 2008

And all the flamers owe PG, and Patrick a big, "Uh.... Sorry, Dude. My bad."

prescod · on Feb 8, 2008

The whole Unicode brouhaha was triggered by Paul's description of the issue. If he had said: "Arc doesn't support Unicode yet but I expect it would be easy and I welcome patches" then it would have not been an issue at all.

Instead, he said: "I realize that supporting only Ascii is uninternational to a point that's almost offensive, like calling Beijing Peking, or Roma Rome (hmm, wait a minute). But the kind of people who would be offended by that wouldn't like Arc anyway."

There was no reason to take a swipe at the "kind of people" who care about internationalization, which is to say:

* people who are not mono-lingual anglophones

* people who want to build real-world applications

Why shouldn't people in those categories be interested in Arc? Why should they be excluded? And why treat it as a matter of political correctness rather than just a technological decision?

Paul threw the first punch and the blogosphere punched back.

Furthermore, his general tendency to divide and conquer the programing world in that way is why there was already a huge pool of haters ready to pounce on him. Every essay of his implies that there are people who get it and people who don't and one can distinguish between them by seeing which people agree with him and which do not.

He only needed to say: "Arc does not yet support Unicode" and the whole thing would have been avoided.

vlad · on Feb 9, 2008

With the understanding that arc was released as a tech demo of a personal project at this stage, I took what pg wrote to mean that he thought he had better things to do than make it do X (in this case, add and test international support). I think there were enough disclaimers in the announcement, web site, and posts to conclude this is all he meant. He simply released his project as it was.

Why should PG say "unicode support will be easy and that he will accept patches"? Wouldn't making promises be contrary to the entire disclaimer he wrote? (Have you read disclaimers before? Paul's addresses at least as many facets as those written by lawyers.)

If bloggers want to influence PG into feeling guilty in order to get him to spend more time on arc regardless of his disclaimer of "when it's done, it's done," they can do it by phrasing things like adults. Or, writing their own languages.

(Great post. Maybe PG wouldn't have realized this perspective had you not posted it for users to upvote. I know I didn't notice your perspective. On the other hand, I wouldn't change a thing. You're expecting PG to plan for all potential hurt feelings, be it Mac users, Linux users, corporations, people in remote locations without access to the Internet because this means they can't download arc, and more. This is impossible to do, and leads to nothing being released at all.)

prescod · on Feb 10, 2008

I'm not expecting PG to plan for hurt feelings. I'm expecting him simply not to go out of his way to hurt feelings. For no particular reason he took a technical time-to-release issue and turned it into a political correctness issue (and therefore made the whole debate around it political).

Simply take this paragraph:

"Which is why, incidentally, Arc only supports Ascii. MzScheme, which the current version of Arc compiles to, has some more advanced plan for dealing with characters. But it would probably have taken me a couple days to figure out how to interact with it, and I don't want to spend even one day dealing with character sets. Character sets are a black hole. I realize that supporting only Ascii is uninternational to a point that's almost offensive, like calling Beijing Peking, or Roma Rome (hmm, wait a minute). But the kind of people who would be offended by that wouldn't like Arc anyway."

And change it to:

"Currently, Arc only supports Ascii. MzScheme, which the current version of Arc compiles to, has some more advanced plan for dealing with characters. At some point (I don't know when) I or someone else on the Arc team will probably figure out how to take advantage of it."

That's all. Say less. Stick to the technology. Avoid politics. Controversy avoided.

olavk · on Feb 9, 2008

It should be noted that the released version of Arc actually did support Unicode. The only reason that it became a controversy was because PG said it was ASCII-only and implied it was a design decision. It is purely a communication issue.

(Btw. same thing with the controversy over presentational markup vs CSS. AFAICT the Arc framework don't do any layout at all out of the box, including any presentational markup.)

Now PG seem to believe that controversy is inevitable when creating a new language so you might as well throw a few random punches. However if you want to build a vibrant community around a language I think it is very important to choose you battles carefully and your swipes wisely.