Sorry to be so meta, but what on earth was the point of extracting programmers.stackexchange.com from stackoverflow.com? Is this why so many questions get closed as being "off topic" on stackoverflow now? </rant>
The original idea was to make it a bit of a waste bin for all the softer questions which had relevance to programmers but weren't directly related to programming. Remember the SO model takes a pretty hard line on what's relevant to try and minimise the noise and this was seen as a solution to what to do with questions which were interesting and semi-relevant but seemed too much like noise to many.
After while Programmers was becoming a bit too much of a dumping ground so tightened up the rules to make it less random and chatty and more of a valid site in it's own right for topics around software development but which are not directly programming related.
> What programming methodology best fits my project and team?
A question like this tends to elicit a much more forceful response on P.SE than it will on Stack Overflow. Largely because the folks who moderate P.SE aren't terribly fond of the common perception that their site is just a dumping ground for questions that are too wishy-washy for SO.
Looking at their current faq, it seems that they've better defined (and perhaps somewhat re-defined) what is to be asked there. Understandable as I stopped browsing the site because the questions got annoying. Compare the old faq: http://web.archive.org/web/20100912194040/http://programmers...
This isn't a config file, this is the query string of a URL, or more importantly the POST data of a form.
From the article:
> By default, older versions of IE (<=8) will submit form data in Latin-1 encoding if possible. By including a character that can't be expressed in Latin-1, IE is forced to use UTF-8 encoding for its form submissions, which simplifies various backend processes, for example database persistence.
> If the parameter was instead utf8=true then this wouldn't trigger the UTF-8 encoding in these browsers
Of course it positively contributes. It's the first comment to say that it's a silly way of encoding the form as utf-8 when utf8="✘" will also do the same thing, even though it's counterintuitive.
It is only counterintuitive because you arbitrarily assigned a different type (boolean, with members you made up) instead respecting that any utf8 character fits the utf8 type.
X is used variously as true and false in Different languages, so why would you assume it means false in a language you have never used?
Well i didn't make it up, but yeah i get your point. I didn't mean that i'd assume it means false, but it's still clearly counterintuitive to use a tick to encode as utf-8 when any character at all will do. A tick has an opposite, an X, a snowman (as in the Rails case) doesn't.
All characters -- including every one in this post -- are unicode characters. The definingly-useful characteristic of the cross (or checkmark) is that it can't be represented in latin1.
It's because there's a false semantic implied with the checkmark (that it's true instead of false), where instead the only thing that reallly matters is the encoding of the character. So it's not the most elegant solution (but it is pretty cool).
no special construct at all, just a bunch of characters that are (or merely look like) upside-down versions of regular latin letters. For example, the "upside-down" u is actually an normal n...
I sort of think the snowman, or at least another character that doesn't carry the connotation of "positive" or "true", is a better idea... but I have to admit any developer who would make that mistake is, alas, so far in over their head anyhow that there's probably no saving them from themselves, so, you know, check's cool.
To be fair, and as you basically said, having “utf8=✓" is counterintuitive when it's being used as an IE hack rather than an actual descriptive parameter. I could imagine many competent enough developers seeing this parameter in someone else's code, or in a book or something, and immediately thinking that changing the tick to an X would make a difference, because that'd only make sense.
I think that vaguely competent would require analytical skills good enough to say, ok utf8="non ascii character", now if I change it to another nonascii character is that likely to change my request's execution path in the server...
A better parameter name would help here, like if it was called utf8char or something, then it would be more descriptive of what it is, instead of seeming like a flag.
I've often wondered if they could get rid of this entirely in rails by enclosing it in conditional comments, so that it is only included in forms sent by older IE:
For simple get queries like a search the utf param is visible in the url, so it can be ugly. There are ways round this of course, but as Rails is inserting it automatically, it would make sense if they only did it for the one browser which requires it.
They do. IIRC they used to use a Unicode snowman instead of a checkmark, but it was changed as the snowman wasn't deemed "enterprisey enough" or something.
Maybe when a service/product/framework etc hits a specific point in it's lifecycle (when it's trying to be enterprisey for example), we can say that it's melted the snowman.
It's partially self documenting. As mentioned elsewhere, there's a vague implication that putting something else there could mean "don't use utf8", where if you put the utf8 X character that corresponds to the check you are still asking for utf8.
edited to add:
This seems related to a problem various "try to sound like English" programming languages (e.g. Inform) have, where it is easy to assume invalid syntax will be valid because it's valid English.
An X is also a commonly accepted way to select a checkbox though. The only sure way to indicate a checkbox as not selected is to leave it empty. That's the true opposite, and actually does work.
If you specify a form's data as UTF-8, but every character the user types in happens to also occur in Latin-1, IE will disobey what you've set and send it in Latin-1 instead. In many locales -perhaps most- that's going to be a very common occurrence, since Latin-1 covers seven of Abram de Swann's twelve "supercentral languages" and three of the top ten languages by number of native speakers.
The point of the checkmark, therefore, is to put it in a hidden form field. That way, no matter what the user types, there will still be at least one non-Latin-1 character. That will force IE to use UTF-8, and you can check to make sure this actually happened by checking the value of the form field: if it's not set correctly, then you know there may be trouble.
I think this only happens if the user manually switches to Latin-1 encoding. In that case IE will try to use the same encoding when submitting form data. The user might do this if you already have encoding problems and present a mixed Latin-1/UTF-8 page. The snowman hack serves to prevent the corruption from spreading.
> I think this only happens if the user manually switches to Latin-1 encoding.
That's correct. If you send your HTML document with a charset of UTF-8 (In the Content-Type header) then IE will submit forms using UTF-8 even if the user doesn't input any UTF-8 characters. Unless the user changes the encoding, but I have yet to hear a compelling reason why an ordinary user would do that under ordinary circumstances.
> The snowman hack serves to prevent the corruption from spreading.
It's clever, but the framework could also just reject POST and GET requests which contain invalid UTF-8 characters. (I'm flabbergasted that Ruby doesn't do this[1].) Otherwise a malicious user could try to inject non-UTF-8 characters into your database by sending crafted requests which nevertheless contain the "utf8=✓". And speaking from experience, you do not want to have to deal with encoding problems in your database.
Whether or not you use this hack, you can't naively trust the client to always send valid UTF-8; you're right about this. But because of this bug in IE, rejecting posts with invalid UTF-8 as malicious will net you some false-positive cases, where the user isn't malicious but the browser is being stupid. This hack takes care of the stupidity, leading to a better user experience for people who would otherwise have tripped the false-positive.
What if a sequence of byte values is valid in the charset that IE uses to encode the form data as well as in UTF-8, but is interpreted as different characters in UTF-8? With your method you would not detect an error and use the wrong characters.
(Except if IE sends a content-type header with the actual encoding used, and this header is evaluated on the server side to convert the form data into a string. But in that case you don't have to check for invalid UTF-8 characters, but for characters that are invalid in the charset specified in the content-type header.)
Yes, that took me a moment too, but Latin-1 is an 8-bit ASCII, and UTF-8 only encompasses 7-bit ASCII. Note I have to say an 8-bit ASCII, because there are numerous 8-bit ASCII encodings.
What? This is wrong. UTF-8 encodes a lot more than just ASCII.
UTF-8 is compatible with ASCII in that all of the characters ASCII and Unicode have in common are represented the same way in ASCII and UTF-8. Going beyond ASCII involves the introduction of multi-byte representations in UTF-8, and that takes you smoothly (that is, no surrogate pairs) out into the entire rest of Unicode. As a bonus, it's always possible to verify that a given string of bytes is valid UTF-8, given that there is a nontrivial structure imposed on UTF-8 multi-byte encodings that is very unlikely to occur by chance in any non-UTF-8 sequence of bytes.
I think the point was that anything above 7-bit ASCII will be represented differently in Latin-1 vs UTF-8; i.e. ¢ (U+00A2) is rendered as 0xA2 in Latin1 and 0xC2A2 in UTF-8 - and 0xC2A2 in Latin1 will be displayed as ¢.
It gets far worse in 3-byte UTF8 characters, but I don't believe any of them exist natively in Latin1 (see: euro symbol)
Assuming I'm reading these various character tables right, at least ;)
So a more accurate version of what you quoted would be "UTF-8 and Latin-1 only overlap for 7-bit ASCII"
That doesn't really make sense. If someone's intending to encode ¢ in utf-8, endianness does not come into play, as it's a stream of octets, not of anything larger that you can chunk such that you could swap bytes.
At any rate, if you were to encode ¢ in UTF-16BE, it would be 0x00a2, not 0xc2a2. If a piece of software then misinterpreted it as latin1, likely you'd get nothing at all due to the embedded NUL.
Indeed. I either completely misread the parent post, or else it said something different when I responded to it (knowing myself, I'm going with the former).
If the point of the field is just to make ie work correctly, wouldn't it be more appropriate to leave utf8 out of the name and write something like "ie=💩"?
If you want to check whether a browser supports a feature, it's preferable to check whether the browser supports the feature instead of whether the browser is a browser that supports the feature. The latter can cause problems when dealing with an unexpected browser.