Parsers are one of those fields I wish I had the time to really study deeply. I'm especially impressed with LALRPOP.
Cribbed from a previous comment of mine:
From time to time I find myself revisiting this thread: "Writing parsers like it is 2017"[0].
In particular I love a Rust parser generator called LALRPOP and it's emphasis on diagnosing ambiguous grammars [1].
> What I’ve tried to do now in LALRPOP is to do that clever thinking for you, and instead present the error message in terms of your grammar. Perhaps even more importantly, I’ve also tried to identify common beginner problems and suggest solutions.
They work out a fairly deep example with error guidance in ambiguous grammars in that post.
Cool idea. The way it handles alternatives looks awkward since each alternative is a field in a struct. You have to check each field to find the one that's not nil.
(See Term in the example.)
Normally this would be done with an interface, like in go/ast
Although, on second thought, this means writing a chain of if-else statements instead of a type switch in a visitor. Maybe it's not that big a difference?
I've used this before - really nice library and an easy way of getting started without having to write so much custom code. Only down side is that of course it's still faster to hand-write a parser :)
This looks really cool. I spent a few weeks last year trying to build my first programming language, but I kept getting bogged down by parsing and eventually gave up.
You should check out Thorsten Ball's interpreterbook.com and compilerbook.com. I went through the former - it's amazing. I'm eagerly awaiting the release of the latter.
It walks through the creation of an entire interpreter from scratch using just core Go packages, including test coverage.
Parsing isn't actually that hard once you wrap your hard around the problem. There are lots of good guides for writing parsers online which really helped me out when I was feeling the same bourdon as you are.
So what I'm trying to say is stick with it and I'm sure you'll crack it. :)
Definitely. I used to always get stuck on the parsing part, and it definitely is time consuming to get right, but once you get past it once it is much easier to tackle in the future.
Then you realize that parsing is only a small piece of making your own language!
The Eureka moment for me was learning that a parser isn't a single entity. It's reading tokens, making an abstract syntax tree, etc. One I learned how to break the problems down I found writing parsers very easy.
As for it only being a small part of making your own language, that is so true too. However I do think it's a rewarding project to tackle.
The ability to parse is one of the things that differentiates "power programmers" from people just hooking things together. If you wish to be one of the former, it is on the Must Study list.
It is true that many, perhaps even most programmers can go their entire career without having to face the problem of writing their own code to take a text file and view it as a tree of tokens. This is even more true than ever with so many off-the-shelf serialization options, meaning that creating your own new one had better have a really, really darned good justification to it. But if you conceptualize a parser as "a thing that receives a stream of low level, primitive inputs and needs to extract sensible state information and know when constraints have been violated"... ah, now that sounds pretty useful, right? And very modern as we continue our inevitable marches towards viewing the world in terms of event streams rather than blobs of completed input that we can take in all at once. The generalized parsing problem of taking a stream of primitives and extracting higher level meaning from them is as relevant as ever, if not more relevant.
How do you write a parser in a language like Go? I stumbled upon parser combinatory and tried to shoehorn them into Go, but the lack of generics mostly prohibited this. I also tried imperatively writing the parser, but this was tedious and error prone. I also explored parser generators, but these seemed to have huge learning curves and still produced awful code. Beyond that, it’s hard to tell what kind of parser I need because of all of the academic jargon. :/
I would love to be pointed in the right direction!
The first thing you need to do is scan your file for tokens (eg "{", "}", "[", "]", ",", "\"", etc in the case of JSON).
Then you traverse your tokens and turn them into an AST (Abstract Syntax Tree) which is essentially just a "struct" version of the code (in the dumbest description imaginable). The point of this is you start adding definitions to your tokens. Eg what does a "[" mean?
From there you can start writing your real business logic since you now have a Go struct version of your code which you can traverse.
However I would really would recommend doing a google for how to write your own compiler. There are dozens of good resources out there (I can't recall the one that I used off hand) and they'll explain this process a thousand times better than I have above.
One last thing to note, the theory of writing a parser is pretty language agnostic so don't get put off that one guide is in C or another in Java or Python or LISP or whatever. As long as you get the theory nailed you can write the Go code equivalent without too much headache.
Parser combinators don't need generics, as they are merely higher-order recursive descent parsers. I.e. recursive parsers where functions are created via other functions to generalize and compose them. Go is expressive enough to do that.
For example, let's say we start with a function that parses a character '['.
Now let's do the same for parsing a sequence. We combine outputs of other functions and return one that parses them one by one.
SEQ := func (args... func()bool) func()bool {
return func()bool {
for i := 0; i < len(args); i++ {
f := args[i]
if f() == false {
return false
}
}
return true
}
}
parser := SEQ(
CHAR('['),
SEQ(
CHAR('1'),
CHAR('2'),
CHAR('3'),
),
CHAR(']'),
)
This is the idea.
This is not a real world example though. You'll need to have more primitives, a wrapper that handles various things, like restoring buffer index on failure, captures, that save starting index, move buf/pos into context structure, virtual stack to store captures/errors, etc.
But it's important to get there on your own. And it's pretty much as fast and as flexible as completely hand-written recursive descent parsers and as powerful as parser generators.
Is that a parser? That looks like a matcher. I actually built something that looks almost exactly like this, but it doesn't get you an AST, right? What am I missing?
To get to AST you have to create AST nodes. Make a function that captures the output and put it into places where you want to create AST nodes. Sort of like this:
SEQ(
...
CAPTURE(
// parser to capture
SEQ(...),
// extra arguments:
// a name, a function to call,
// that creates an AST node, etc.
),
)
Cribbed from a previous comment of mine:
From time to time I find myself revisiting this thread: "Writing parsers like it is 2017"[0].
In particular I love a Rust parser generator called LALRPOP and it's emphasis on diagnosing ambiguous grammars [1].
> What I’ve tried to do now in LALRPOP is to do that clever thinking for you, and instead present the error message in terms of your grammar. Perhaps even more importantly, I’ve also tried to identify common beginner problems and suggest solutions.
They work out a fairly deep example with error guidance in ambiguous grammars in that post.
[0]: https://news.ycombinator.com/item?id=15016061
[1]: http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice...