Is the use of “utf8=✓” preferable to “utf8=true”?

ollysb · on Oct 18, 2012

Sorry to be so meta, but what on earth was the point of extracting programmers.stackexchange.com from stackoverflow.com? Is this why so many questions get closed as being "off topic" on stackoverflow now? </rant>

Tyrannosaurs · on Oct 18, 2012

The original idea was to make it a bit of a waste bin for all the softer questions which had relevance to programmers but weren't directly related to programming. Remember the SO model takes a pretty hard line on what's relevant to try and minimise the noise and this was seen as a solution to what to do with questions which were interesting and semi-relevant but seemed too much like noise to many.

After while Programmers was becoming a bit too much of a dumping ground so tightened up the rules to make it less random and chatty and more of a valid site in it's own right for topics around software development but which are not directly programming related.

desbest · on Oct 18, 2012

Whenever I use Programmers, all my questions get closed there.

novaleaf · on Oct 22, 2012

me too :(

xhrpost · on Oct 18, 2012

I believe the guidelines divide the two as objective vs. subjective.

Stackoverflow (Objective): Why does this code give me a syntax error?

Programmers (Subjective): What programming methodology best fits my project and team?

bunderbunder · on Oct 18, 2012

> What programming methodology best fits my project and team?

A question like this tends to elicit a much more forceful response on P.SE than it will on Stack Overflow. Largely because the folks who moderate P.SE aren't terribly fond of the common perception that their site is just a dumping ground for questions that are too wishy-washy for SO.

mratzloff · on Oct 18, 2012

Usually subjective questions are now closed as "not constructive".

xhrpost · on Oct 18, 2012

Looking at their current faq, it seems that they've better defined (and perhaps somewhat re-defined) what is to be asked there. Understandable as I stopped browsing the site because the questions got annoying. Compare the old faq: http://web.archive.org/web/20100912194040/http://programmers...

eldavido · on Oct 18, 2012

Programmers is for things related to the profession and trade of programming, and is more people- and politics-oriented (e.g. how do I get a raise?)

Stackoverflow is for stuff strictly about code, APIs, languages, syntax, etc.

brown9-2 · on Oct 18, 2012

And the confusing thing is that this question would seem to fall into the latter bucket.

batgaijin · on Oct 18, 2012

It's the Wikia play on becoming the QA for specific things.

Quora couldn't grow because they wanted to become THE site, SO is smart enough to pander to niches.

jerf · on Oct 18, 2012

So, of course, the opposite of that is utf8="✘", right?

Hmmm... there's something wrong with that idea, but I can't quite put my finger on it....

buro9 · on Oct 18, 2012

You didn't read the article.

This isn't a config file, this is the query string of a URL, or more importantly the POST data of a form.

From the article:

> By default, older versions of IE (<=8) will submit form data in Latin-1 encoding if possible. By including a character that can't be expressed in Latin-1, IE is forced to use UTF-8 encoding for its form submissions, which simplifies various backend processes, for example database persistence.

> If the parameter was instead utf8=true then this wouldn't trigger the UTF-8 encoding in these browsers

Sumaso · on Oct 18, 2012

"Hmmm... there's something wrong with that idea, but I can't quite put my finger on it...."

I think your missing the sarcasm...

jemfinch · on Oct 18, 2012

No, the sarcasm is obvious. It's just not very funny, and doesn't positively contribute to the level of discourse here.

anjc · on Oct 18, 2012

Of course it positively contributes. It's the first comment to say that it's a silly way of encoding the form as utf-8 when utf8="✘" will also do the same thing, even though it's counterintuitive.

jerf · on Oct 18, 2012

That is actually what I was going for; the humor attempt was a bonus.

anjc · on Oct 18, 2012

ConstructiveHumour=✓

Wait, true...true

arcatek · on Oct 19, 2012

I have already seen a website which use "ie=☠" in the URL.

Evbn · on Oct 19, 2012

It is only counterintuitive because you arbitrarily assigned a different type (boolean, with members you made up) instead respecting that any utf8 character fits the utf8 type.

X is used variously as true and false in Different languages, so why would you assume it means false in a language you have never used?

anjc · on Oct 19, 2012

Well i didn't make it up, but yeah i get your point. I didn't mean that i'd assume it means false, but it's still clearly counterintuitive to use a tick to encode as utf-8 when any character at all will do. A tick has an opposite, an X, a snowman (as in the Rails case) doesn't.

nsxwolf · on Oct 18, 2012

I laughed.

tomrod · on Oct 18, 2012

Could you explain it then? I'm really confused!

aviraldg · on Oct 18, 2012

That would spoil the joke.

˙ɔıuoɹı sı ʇɹoddns ǝpoɔıun ou s,ǝɹǝɥʇ ʇɐɥʇ ǝʇɐɔıpuı oʇ ʇı buısn os 'ɹǝʇɔɐɹɐɥɔ ǝpoɔıun ɐ ɟןǝsʇı uı sı ssoɹɔ ʇɐɥʇ

tomrod · on Oct 18, 2012

Clear enough then! But why upside down? ;-)

SiVal · on Oct 18, 2012

More important to me, HOW upside down? How did you do that?

skeletonjelly · on Oct 18, 2012

http://www.fliptext.org/

jlgreco · on Oct 18, 2012

Unicode is fun. ;)

aviraldg · on Oct 19, 2012

Haven't you ever tried to solve puzzles in comic books?

gpvos · on Oct 19, 2012

Because he can.

kelnos · on Oct 20, 2012

pedant-mode on

All characters -- including every one in this post -- are unicode characters. The definingly-useful characteristic of the cross (or checkmark) is that it can't be represented in latin1.

lifeformed · on Oct 18, 2012

It's because there's a false semantic implied with the checkmark (that it's true instead of false), where instead the only thing that reallly matters is the encoding of the character. So it's not the most elegant solution (but it is pretty cool).

PuercoPop · on Oct 18, 2012

I think you are right. Maybe something like force_utf8=✓ would be more clear?

Evbn · on Oct 19, 2012

That would be more misleading.

SampleUtf8Char= would be more clear.

lifeisstillgood · on Oct 18, 2012

Ok - I cannot reply to your child post @aviraldg but I really want to know what utf-8 construct turned your text upside down.

Ideka · on Oct 18, 2012

Check google: https://www.google.com/search?q=type+upside+down

brazzy · on Oct 18, 2012

no special construct at all, just a bunch of characters that are (or merely look like) upside-down versions of regular latin letters. For example, the "upside-down" u is actually an normal n...

lifeisstillgood · on Oct 18, 2012

Do you have a mapping available? Presumably a script?

Edit: Please :-)

skeletonjelly · on Oct 18, 2012

http://www.fliptext.org/

which has this javascript code to convert:

http://www.fliptext.org/flip.js?c=2

smegel · on Oct 19, 2012

He wasn't being sarcastic, he was being sardonic

keefe · on Oct 18, 2012

it's just a joke using a unicode x-ish char would still do the same as the check...

jerf · on Oct 18, 2012

I sort of think the snowman, or at least another character that doesn't carry the connotation of "positive" or "true", is a better idea... but I have to admit any developer who would make that mistake is, alas, so far in over their head anyhow that there's probably no saving them from themselves, so, you know, check's cool.

anjc · on Oct 18, 2012

To be fair, and as you basically said, having “utf8=✓" is counterintuitive when it's being used as an IE hack rather than an actual descriptive parameter. I could imagine many competent enough developers seeing this parameter in someone else's code, or in a book or something, and immediately thinking that changing the tick to an X would make a difference, because that'd only make sense.

keefe · on Oct 18, 2012

I think that vaguely competent would require analytical skills good enough to say, ok utf8="non ascii character", now if I change it to another nonascii character is that likely to change my request's execution path in the server...

alexkus · on Oct 18, 2012

Parkinson's law of triviality.

smackfu · on Oct 18, 2012

A better parameter name would help here, like if it was called utf8char or something, then it would be more descriptive of what it is, instead of seeming like a flag.

grey-area · on Oct 18, 2012

I've often wondered if they could get rid of this entirely in rails by enclosing it in conditional comments, so that it is only included in forms sent by older IE:

Has anyone experimented with doing that?

riffraff · on Oct 18, 2012

what would the gain be?

simonw · on Oct 18, 2012

It would be less likely to leak out in to a GET string and confuse people.

Dylan16807 · on Oct 18, 2012

It also gives people more info to search with.

grey-area · on Oct 18, 2012

For simple get queries like a search the utf param is visible in the url, so it can be ugly. There are ways round this of course, but as Rails is inserting it automatically, it would make sense if they only did it for the one browser which requires it.

jasonlingx · on Oct 18, 2012

Correct me if I'm wrong but I think forms in Rails do this by default.

mosburger · on Oct 18, 2012

They do. IIRC they used to use a Unicode snowman instead of a checkmark, but it was changed as the snowman wasn't deemed "enterprisey enough" or something.

nicholassmith · on Oct 18, 2012

Maybe when a service/product/framework etc hits a specific point in it's lifecycle (when it's trying to be enterprisey for example), we can say that it's melted the snowman.

xanadohnt · on Oct 18, 2012

I mean ... this is one of the funniest comments I've read on HN. I will be diligent in adopting this into my nerd vernacular.

nicholassmith · on Oct 18, 2012

If it ends up as a thing I'll be surprised and proud.

mmahemoff · on Oct 18, 2012

I prefer the tick. It's self-documenting which makes it more useful and that also makes it amusing in its own way.

dllthomas · on Oct 18, 2012

It's partially self documenting. As mentioned elsewhere, there's a vague implication that putting something else there could mean "don't use utf8", where if you put the utf8 X character that corresponds to the check you are still asking for utf8.

edited to add: This seems related to a problem various "try to sound like English" programming languages (e.g. Inform) have, where it is easy to assume invalid syntax will be valid because it's valid English.

3143 · on Oct 19, 2012

An X is also a commonly accepted way to select a checkbox though. The only sure way to indicate a checkbox as not selected is to leave it empty. That's the true opposite, and actually does work.

dllthomas · on Oct 19, 2012

Fair point.

griffindy · on Oct 18, 2012

they did use a snowman before, though I can't find the commit where they changed it.

EDIT: found the commit with some git log magic - https://github.com/rails/rails/commit/c616089

garethadams · on Oct 18, 2012

You're not wrong. I'm surprised my answer's got so much attention to be honest, because it's so common in Rails apps.

IgorPartola · on Oct 18, 2012

Under what case would IE use Latin 1 when there are UTF-8 characters that should be encoded? I seem to be missing the actual effect it's having.

Millennium · on Oct 18, 2012

If you specify a form's data as UTF-8, but every character the user types in happens to also occur in Latin-1, IE will disobey what you've set and send it in Latin-1 instead. In many locales -perhaps most- that's going to be a very common occurrence, since Latin-1 covers seven of Abram de Swann's twelve "supercentral languages" and three of the top ten languages by number of native speakers.

The point of the checkmark, therefore, is to put it in a hidden form field. That way, no matter what the user types, there will still be at least one non-Latin-1 character. That will force IE to use UTF-8, and you can check to make sure this actually happened by checking the value of the form field: if it's not set correctly, then you know there may be trouble.

pdw · on Oct 18, 2012

I think this only happens if the user manually switches to Latin-1 encoding. In that case IE will try to use the same encoding when submitting form data. The user might do this if you already have encoding problems and present a mixed Latin-1/UTF-8 page. The snowman hack serves to prevent the corruption from spreading.

agwa · on Oct 18, 2012

> I think this only happens if the user manually switches to Latin-1 encoding.

That's correct. If you send your HTML document with a charset of UTF-8 (In the Content-Type header) then IE will submit forms using UTF-8 even if the user doesn't input any UTF-8 characters. Unless the user changes the encoding, but I have yet to hear a compelling reason why an ordinary user would do that under ordinary circumstances.

> The snowman hack serves to prevent the corruption from spreading.

It's clever, but the framework could also just reject POST and GET requests which contain invalid UTF-8 characters. (I'm flabbergasted that Ruby doesn't do this[1].) Otherwise a malicious user could try to inject non-UTF-8 characters into your database by sending crafted requests which nevertheless contain the "utf8=✓". And speaking from experience, you do not want to have to deal with encoding problems in your database.

[1] http://stackoverflow.com/questions/3222013/what-is-the-snowm...

Millennium · on Oct 18, 2012

Whether or not you use this hack, you can't naively trust the client to always send valid UTF-8; you're right about this. But because of this bug in IE, rejecting posts with invalid UTF-8 as malicious will net you some false-positive cases, where the user isn't malicious but the browser is being stupid. This hack takes care of the stupidity, leading to a better user experience for people who would otherwise have tripped the false-positive.

Tobias42 · on Oct 18, 2012

What if a sequence of byte values is valid in the charset that IE uses to encode the form data as well as in UTF-8, but is interpreted as different characters in UTF-8? With your method you would not detect an error and use the wrong characters. (Except if IE sends a content-type header with the actual encoding used, and this header is evaluated on the server side to convert the form data into a string. But in that case you don't have to check for invalid UTF-8 characters, but for characters that are invalid in the charset specified in the content-type header.)

carllerche · on Oct 18, 2012

Ordinary users would do it because many corporate sites used to direct them to do this as a fix for their browsers displaying invalid characters.

agwa · on Oct 18, 2012

Would they do that if their browser isn't displaying invalid characters? That's what I meant by "ordinary circumstances."

wycats · on Oct 18, 2012

Corrupted characters in a MySQL database is unfortunately quite common. The user fix will thus propagate the corruption without this fix.

jerf · on Oct 18, 2012

Yes, that took me a moment too, but Latin-1 is an 8-bit ASCII, and UTF-8 only encompasses 7-bit ASCII. Note I have to say an 8-bit ASCII, because there are numerous 8-bit ASCII encodings.

ygra · on Oct 18, 2012

ASCII is a 7-bit code. Latin 1 uses 8 bits and therefore cannot be ASCII. It's a superset of ASCII, though.

derleth · on Oct 18, 2012

> UTF-8 only encompasses 7-bit ASCII

What? This is wrong. UTF-8 encodes a lot more than just ASCII.

UTF-8 is compatible with ASCII in that all of the characters ASCII and Unicode have in common are represented the same way in ASCII and UTF-8. Going beyond ASCII involves the introduction of multi-byte representations in UTF-8, and that takes you smoothly (that is, no surrogate pairs) out into the entire rest of Unicode. As a bonus, it's always possible to verify that a given string of bytes is valid UTF-8, given that there is a nontrivial structure imposed on UTF-8 multi-byte encodings that is very unlikely to occur by chance in any non-UTF-8 sequence of bytes.

Firehed · on Oct 18, 2012

I think the point was that anything above 7-bit ASCII will be represented differently in Latin-1 vs UTF-8; i.e. ¢ (U+00A2) is rendered as 0xA2 in Latin1 and 0xC2A2 in UTF-8 - and 0xC2A2 in Latin1 will be displayed as Â¢.

It gets far worse in 3-byte UTF8 characters, but I don't believe any of them exist natively in Latin1 (see: euro symbol)

Assuming I'm reading these various character tables right, at least ;)

So a more accurate version of what you quoted would be "UTF-8 and Latin-1 only overlap for 7-bit ASCII"

kbolino · on Oct 19, 2012

Not to detract from your points, all good, but:

0xC2A2 will be rendered as Â¢ only if it's encoded in UTF-16/UCS-2 big endian and misinterpreted as ISO-8859-1/Windows-1252.

If it's encoded in little endian (much more common on Intel x86 computers), then it would be rendered as ¢Â when misinterpreted.

kelnos · on Oct 20, 2012

That doesn't really make sense. If someone's intending to encode ¢ in utf-8, endianness does not come into play, as it's a stream of octets, not of anything larger that you can chunk such that you could swap bytes.

At any rate, if you were to encode ¢ in UTF-16BE, it would be 0x00a2, not 0xc2a2. If a piece of software then misinterpreted it as latin1, likely you'd get nothing at all due to the embedded NUL.

  $ echo -n ¢ | iconv -f UTF-8 -t UTF-16BE | hexdump -C
  00000000  00 a2                                             |..|

kbolino · on Oct 25, 2012

"That doesn't really make sense."

Indeed. I either completely misread the parent post, or else it said something different when I responded to it (knowing myself, I'm going with the former).

jpenner · on Oct 18, 2012

Some accented characters are in Latin-1, like ç or é.

gweinberg · on Oct 18, 2012

If the point of the field is just to make ie work correctly, wouldn't it be more appropriate to leave utf8 out of the name and write something like "ie=💩"?

Zakharov · on Oct 18, 2012

If you want to check whether a browser supports a feature, it's preferable to check whether the browser supports the feature instead of whether the browser is a browser that supports the feature. The latter can cause problems when dealing with an unexpected browser.

aviraldg · on Oct 18, 2012

Best way to detect a Ruby on Rails app ;)

haileys · on Oct 18, 2012

Another good one is sending a query string of '?a=1&a[a]=1'. 500s Rack applications.

randomdata · on Oct 18, 2012

Navigating to /500.html is another good way. Most people do not remove it and most other frameworks/people do not create it in the first place.

tlrobinson · on Oct 19, 2012

Does IE not respect the "accept-charset" attribute on form elements?

gpvos · on Oct 19, 2012

Only partially; read the other comments.

bitwize · on Oct 18, 2012

Ummmm, false. I'm going to go with false.