These two are great examples of why i think "overloading" syntax symbols sucks. Problem is, nearly every language does that, and in many cases it makes some expressions ambiguous (not to the parsers, as they have those corner-cases defined, but ambiguous to the programmers):
- in JavaScript braces are used both as block delimiters and object literal delimiters
- in CoffeeScript, which prevents that particular grammar ambiguity from JS, whitespace is used for many different things: scope/block delimiting, object literals, even function calls (optional parentheses)
- Python uses parentheses both for expression grouping (precedence), function calls AND tuples
- square brackets are used in many languages to both denote literal lists/arrays and element access ([] operator)
- etc, etc, etc
There are very few languages that i know about don't do this kind of lexical-token overloading. Smalltalk is an example: square brackets are always blocks, parentheses are only used to group expressions, dots are always statement terminators, etc. I think Haskell is another example of a straightforward syntax, but don't quote me on that.
Is there any formal name for these kinds of non-overloaded/simple syntaxes? If there is a formalism for those, why is it that "overloaded" syntaxes are preferred most of the time, or why are they so prevalent?
Note that "overloaded" operators does not mean the grammar is context-sensitive (let alone ambiguous). For instance, the parens as grouping and tuple operators in Python are indeed ambiguous (you have to parse to the first comma or end parens to know which is which), but they're not ambiguous with funcalls: a function call opening paren an infix, it can't be the start of a new expression. Same with square brackets for both array literal and element access, one is a prefix, the other one is an infix. Similarly, there is no ambiguity to e.g. the `+` or `-` operator being both infix (addition and subtraction) and prefix (positive and negate)
Totally right. I was aware that this "overloading" does not imply ambiguity for the parser; it's just that i find it confusing to read in some cases (like the tuples in Python, or the {} in JS), and i think it's curious that language designers deliberately choose to use the same symbol for totally different things when they could use something else, or keep the grammar simpler.
I guess it's always a trade-off between simplicity at the grammar level and ease of use, but i'm not so sure about that either. Smalltalk has a very simple grammar, yet that doesn't make it harder to use than other languages with similar semantics; e.g. Ruby, which prefers to be "programmer friendly" by having multiple syntactic forms for denoting blocks, even a special syntax for calling function whose last parameter is a block, and much much more.
> i think it's curious that language designers deliberately choose to use the same symbol for totally different things when they could use something else, or keep the grammar simpler.
The problems is that there's only a limited number of symbols available in ASCII, especially "matching"/"paired" symbols where you've got all of three pairs to work with (`{}`, `[]` and `()`), with one pair having a lot of immutable baggage (`()`) and one being coopted by C's inheritance (`{}`). (I'm not counting `<`/`>` as they have even more historical baggage as comparison operators)
Now of course we could use non-paired symbols even for paired situations, but I guess these symbol pairs look... right? Especially for situations where we're defining a "section" or "grouping" rather than a coherent block (such as a string)
If you express your grammar in a Pratt parser (http://en.wikipedia.org/wiki/Pratt_parser), which associates parsing actions with tokens, the amount of "token overloading" in the syntax becomes obvious.
As for why tokens are overloaded--you just run out of good tokens if you don't reuse them. Consider the tokens '(' and '['. It's quite common to use '(' in the prefix context for grouping, and '(' in the infix context to mean a function call. How do you eliminate the overloading? You can make function calls use whitespace as an infix operator, but that creates a host of other problems. Also, overloading is useful for creating parallelism in the syntax. You might use '[' in the prefix context to signify literal arrays and '[' as an infix operator to signify array dereferencing. In that situation, overloading is synergestic.
> Is there any formal name for these kinds of non-overloaded/simple syntaxes? If there is a formalism for those, why is it that "overloaded" syntaxes are preferred most of the time, or why are they so prevalent?
Someone might be able to say with more certainty, but I believe this is covered by the formalism of "Context Free Grammars" versus "Context Sensitive Grammars"[1]. That is, a language is context free if you don't need information from elsewhere in the code to determine the meaning of a symbol.
There's a good discussion of what makes C context sensitive on Eli Bendersky's blog[2].
Edit:
Thinking about this some more, I'm realizing that this has little to do with the overloading of symbols like block delimiters, since these can be tokenized and parsed by a context free grammar fairly easily in an "overloaded" fashion.
Maybe what you're describing is related to the concept of "purely functional," or even some more vague notion of purity generally...
I don't think "context free" is the concept i was referring to. AFAIK, a grammar can be context free but still have these kind of "syntactic overloads" annoyances. E.g. in Python these expression will always be parsed the same no matter its context: "(foo(bar, baz) + (a,))"; it bothers me, however, that each one of those three pair of parentheses has a different syntactic meaning (one is for grouping, one is for function call, and the last one is for a tuple). Not ambiguous for the parser, but quite inconvenient for my brain's parser that i have distinguish between three possible different meanings for the same syntactic symbol.
Such an unambiguous grammar would only have as many state transitions in its FSM than characters in the grammar, which would limit the language considerably, so you would have to raise the number of characters.
Also ( means something different in a string, so in that sense evert programming language with strings are hard to parse locally.
Also note that ambiguity formally means that more than one parse trees can represent the same string.