Hacker News new | past | comments | ask | show | jobs | submit login
So you want to be a compiler wizard (2016) (belkadan.com)
247 points by goranmoomin on April 12, 2020 | hide | past | favorite | 61 comments



Some other good projects:

* Lambda calculus evaluator

* Hindley Milner typechecker for lambda calculus

* Stack based calculator (extend with variables)

* Play around with macros

* Work through Crafting Interpreters

But really in my experience the best way to get better at compilers (I can't claim to be a wizard) is to just build a goddamn compiler. Start by writing a parser (you can crib from Crafting Interpreters), writing the simplest possible typechecker for arithmetic (crib from Hindley Milner), then a code generator to whatever target you want. Code generation is tricky, but if you're generating arithmetic, it's really not that bad.

If at any point you get confused or have no clue what to do, take a deep breath and guess! Yep, just guess how to do it. I've done this quite a few times and later learned that my guess was really a dumbed down version of idk, solving the AST typing problem, or lambda lifting, or whatever. If that's too hard, scroll through other compilers and figure out what they do.

Once you get something working, celebrate! It's pretty cool seeing even arithmetic end up as code generated and run.

Then start adding other features! Add local variables. Add nested functions. Add whatever.

The secret is to treat a compiler as a challenging software development project. Because that's what it is. Start writing code and figure it out.

Granted this is not great advice if you want a usable compiler at the end. I'm just trying to learn, not make something for people to use.


> in my experience the best way to get better at compilers (I can't claim to be a wizard) is to just build a goddamn compiler.

Yup. In fact, the best way to start is to build a desk calculator, where you type in strings like:

    4 + 6
and it gives you back 10. There was one in the old K+R book. A compiler just scales that up.


For context, parent wrote a C++ compiler and invented the D programming language.


Exactly. Just start from the beginning and work through each problem as you get to it. Keep the scope small and iterate. You'll probably find yourself wanting to start from scratch a lot once you learn how to do it better, that's ok.

I'm writing up a tutorial for my students right now. It is just a stripped down dialect of BASIC. I think seeing the entire compiler process from start to finish on a tiny language is what helped me the most.


I got my start from the Tiny Pascal compiler listing from BYTE Magazine in the 1970's. It was a complete compiler and interpreter.



Thanks! First time reading Byte mag. Kinda almost half century late :)

The article is written with amazing (these days) clarity. Kinda coincided with my current interest in LLVM. Nice to see this article also be coming from UofI Urbana-Champaign grads.

Reportedly the article series misses implementation of one component - run-time, which is needed in order to fully replicate it now. Well, anyway, reading on!

Btw: HN thread on Tiny Pascal from 2018 https://news.ycombinator.com/item?id=17220507


Thank you. I was too lazy to look it up.


Always happy to lend context to these things; BYTE was so fantastic.


How do I learn to write a type checker? Got any suggested readings?


Type checkers are actually surprisingly easy to write. I recommend learning about unification (in the logic sense), maybe just by playing around with Prolog. After you understand that I think type checking is rather easy to figure out (it's literally just unification).


I think it depends on whether you have type inference or not. If you have type inference, then it's like unification. If you have explicit type annotations, it's even simpler than that.

https://eli.thegreenplace.net/2018/unification/

https://eli.thegreenplace.net/2018/type-inference/


And if you want nominal sub typing, F-bounded generics, and some inference, things get really fun.


Nominal subtyping is trivial and unsafe. The fun starts with structural subtyping.


Nominal sub typing is hard from trivial and calling it unsafe is non-sensical. Structural subtyping yields poor error messages and does not support encapsulation very well, especially if you also want separate compilation.


Imperative languages without type inference (for variables and functions, that is) work from the bottom up. Symbols (variables, functions) have declared types, constants have known types. Operators have overloads, and depending on the arguments a particular overload is chosen which best matches, and its return type is the type of the operator expression. Type conversions may need to occur along the way to make the arguments to operators or functions fit.

The hairiest thing is usually overload resolution (both operator and function, if implemented). Determining the set of applicable overloads may depend in the types of the arguments, scoping rules, or generic instantiation if it's in the language, and determining the best overload depends on weighing up coercions, subtyping and polymorphism. Simple rules may lead to rejection of overload selection as ambiguous in situations where programmers don't want to define more overloads.


I have a simple implementation of the Hindley Milner Type Inference algorithm in OCaml that should be easy to grok: https://github.com/prakhar1989/type-inference


I fully agree. I wrote my elements compilers this way. Slowly from a tokenizer, parser up to .net backend, Java backend. Bit by bit. The only way to learn is by doing.

I'd even go as far as saying the literature on the subject is overrated or out of date.

And not everything is set in stone. Not all compilers have a single symbol table. Sometimes it makes sense to have a type/global table and local table, and check the current class scope on the fly. Sometimes not. It doesn't really.

Abstraction doesn't make sense until you actually have multiple implementations. And you might have rewritten it four times by then.

My advice like above. Just start writing


As an alternative to Hindley-Milner, also consider a compositional type system (https://unsafePerform.IO/projects/talks/2016-06-compty/CompT... it should be a very good match for the kind of simple HM type systems that you're going to encounter while learning about compilers.

Here's code if you prefer that to slides: https://github.com/gergoerdi/hm-compo


I second your recommendation of Crafting Interpreters. Not only is it free, but it's an excellent way to learn tokenization, parsing, and tree walking, it's a lot of fun to read.


Here's a valuable tip for making a compiler/interpreter: don't use a separate lexer and parser like lex and yacc. Parser combinators (also known as parser monads) are so much easier to use. In fact, if you already have decided upon your language's grammar, you can probably write yourself a parser and lexer with a parser combinator inside a day.


Yes! I find that writing a parser by hand (using parser combinator) is the most straightforward way (once you know how to write one).

It can also support variable-length lookahead, which is really nice.


This is going to be language-specific: parser combinators are much more pleasant in Haskell than in C++ for example.

Handwritten recursive descent has a lot going for it, error messages in particular.


There is a pretty nice parser combinator library in the capnproto compiler: https://github.com/capnproto/capnproto/tree/master/c%2B%2B/s.... Here is an example of how it is used: https://github.com/capnproto/capnproto/blob/950d23a13f1d602d...


They're not too bad in C++; I'm writing one.


How easier are they? I did the lex and yacc for javascript and it was easy (except parsing js's // regex format in lex needed a little hack)


The easiest. Much much easier than lex and yacc.


What is the class of grammar they can handle? (I mean in the sense of LALR(k), LL(k), etc.)

And is there a large (e.g. exponential) performance penalty for large lookahead?


There's still "lookahead" in a hand written parser. The lookahead is decided case by case.

Performance penalty only comes if we backtrack in our hand written parsers. I don't think we usually do that.

Any glimpse of top-down parser would be simply too slow.


Agreed, and this goes even smoother when one realizes the isomorphism between (naive) parser combinators and parsing expression grammars.


The author seems to be misinformed about the relative value of compiler front-end versus compiler back-end.

Front-end is not where the challenges are in any production compiler. The hard work is in the optimizer. Lexical and syntactic analysis are homework problems for sophomores. That is not where any of the value in a production-worthy compiler lives, no more than the value of your C++ code being found in the semi-colons at the end of the statements.


Judging by the author's advices, he's referring to a toy compiler, not a production ready compiler.

Language architecture and writing a compiler are very difficult tasks if the intent is to come up with a production ready compiler.

I suppose that if you want to do a toy compiler for a very small and limited language, there are simpler ways than getting into lexers and parsers. You can str replace with assembly language or C instructions and function and compile to C or assembler.


"Front-end is not where the challenges are in any production compiler."

It's an extremely complex problem in itself, because most frontends strive for supporting autocompletion of broken/half-finished code with good performance and type hints.

How many are really happy with their IDE's? I remember Borland Pascal IDE and Visual Studio 6 being shining examples and after that... I don't really know what happened. Tried a Scala project with VSCode, takes around a minute for it to pick up my variable changes and 100% CPU, even worse with SwiftUI and XCode. And the computers back then had what? 600mhz?

I'd say, move some c++ graybeards to the frontend team to fix this mess.


The large focus on parsing and grammar theory in traditional CS education is part of this, despite it's extremely limited usefulness in any production compiler.

There's plenty of interesting work to do in the frontend too but a lot of it is past the parsing stage.


The author worked on the Swift compiler frontend.


Wrt to the outline for someone wanting to self study in the area, what are you thoughts?


By "this area", are you referring to front-end, or back-end?

I am going to guess back-end, because there are a lot of self-study tutorials for front-end that you probably have already found. Unfortunately, I am not an expert in back-end, although at one point I did manage a group that was involved in compiler validation -- but I was pointy-haired, it was my team that knew the innards of the compiler.

Advanced texts in compiler construction are going to get into data flow analysis and liveness testing, and talk about basics of code motion. These are all elementary topics and barely touch on the state-of-the-art, but are foundational. Also, get good at reading the assembly language for the machine of your choice and look at the .S files.

Sorry I can't be of more help, but maybe I gave you some search terms.


I was speaking generally in terms of self learning in compilers. Your criticism was that the article focused too heavily on the front end and that the magic or the needed focus is on back end issues (scheduling, selection).

I think there are a couple things in play here. Folks working with text, semi-structured data, synthesizing from disparate sources, etc will be front end heavy. Tokenization, lexing, is important outside of more than compilers, like loading binary formats from network or disk into memory.

For backend work, being able to extend or modify existing backends is important for languages targeting different runtimes (Spark, Beam, Impala). This can be in targeting new architectures or for predicate pushdown into data processing pipelines. Lots of different applications to use those skills.

Compilers and Database systems are an incredible microcosm of many areas of CS.

Areas of self study I think are nice are

MAL - Make a Lisp https://github.com/kanaka/mal

Nand2Tetris, project 11 https://www.nand2tetris.org/project11 (one should start from zero and make your way here, it is journey not a destination)

An educational software system of a tiny self-compiling C compiler, a tiny self-executing RISC-V emulator, and a tiny self-hosting RISC-V hypervisor. http://selfie.cs.uni-salzburg.at

LLVM is a huge system, libFirm is a much smaller, simpler system that includes a c front end. From their site

> libFirm is a C library that provides a graph-based intermediate representation, optimizations, and assembly code generation suitable for use in compilers.

https://pp.ipd.kit.edu/firm/


I also recommend checking out the Make a Lisp project: https://github.com/kanaka/mal

You can see the source code for a small Lisp interpreter in 81 different languages.


In addition to the suggestions in the post, I would strongly recommend https://www.craftinginterpreters.com/

It’s a very clear introduction to creating a language and building parser, interpreter, compiler, and VM for it. The book uses Java and C, but you can use pretty much any language you want (ex: I used Swift).


Also recommended:

Peter Norvig's lisp interpreter series (Python):

https://norvig.com/lispy.html

https://norvig.com/lispy2.html

Thorsten Ball's interpreter and compiler book (Go):

https://interpreterbook.com/


There is missing a clear path of how to become a compiler "wizard". Abstract suggestions on things to learn don't turn you into a compiler wizard. Also, the bit at the end about open source developer types felt like a generalized and incoherent rant.


I don't think there is a clear path, or a collection of clear paths. If you're interested, read stuff and try stuff. With hard work and enough luck you might be a "wizard" some day. It would be silly to make promises.

For whatever it's worth, I have a PhD in compilers, and I work as a compiler developer. I have done nothing listed in the article except learn about regular expressions.

Also, I disagree with your disagreement with the last part of the article.


I actually want to pursue a career around engineering compilers. There's a company I'm planning on applying to next year: at the moment I'm studying a part-time MSc in computer science while working full-time in operations for a development company.

What resources have you found helpful for your Ph.D.?

In your opinion which journals/articles are classic/must be read in this field?

Is there any topics (or books covering these topics) that are a "must know"?


This is a very broad question, since compilation is a very broad field. You should get a broad overview of the whole area through one of the standard compiler textbooks. I think the only one I have read cover to cover was the Dragon Book. It was one of the older editions that are no longer up to date; I'm sure the newer ones are fine. I'm also hearing very good things about Appel's "Modern Compiler Implementation". There are others as well. Read one, and follow the literature references for whatever areas interest you most.

If there is one "cross-cutting concern" in compilers, it's the importance of program representations. The correct representation will allow you to do things that you wouldn't be able to do otherwise, or only at much higher cost. So some more concrete things to look into are SSA form and the Sea of Nodes representation (for the latter, Click: "Global Code Motion/Global Value Numbering", https://courses.cs.washington.edu/courses/cse501/04wi/papers...). Some general graph algorithm stuff (depth-first search, cycle detection, dominance) is useful.

One surprisingly commonplace, simple, and very useful thing to know is the Union-Find data structure (https://en.wikipedia.org/wiki/Disjoint-set_data_structure). I've used it in various settings in compilation. Once I was doing something in GCC and needed a union-find implementation; poking around, I found three or four of them. None were reusable, so I added one more at the place where I needed it :-(

As for journals/articles, much of the seminal compiler work appeared in the proceedings of PLDI, but that doesn't mean that it makes sense to methodically go through 30-odd years of historical papers. ACM Computing Surveys (https://dl.acm.org/journal/csur) can be quite good, if they are not too old. If you are looking at a specific area and see a reference to a survey paper on that area, definitely follow it. But all this doesn't mean that you should only focus on certain conferences or journals. If you want to dig PhD-level deep, it's very much about following references in general.

Good luck! Let me know if you have further questions.


What about something more specific for production-grade compilers?

The majority of people those references will learn toy-compilers, that are surely important but a completely different league than production-grade compilers, e.g: LLVM.

Talking specifically about LLVM, does someone have their go-to references to start and have a sense of the infrastructure, or even some specific reading about a part of the (huge) infrastructure?


LLVM has a pretty decent walk-through/tutorial for building an LLVM language frontend. I know it's in the article but this is definitely my go to for helping people learn about using LLVM.

https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index...

Not by any means a tutorial but Rust has a guide for understanding their LLVM compiler frontend. It has some useful insights into what actually makes up a real "production grade" compiler outside the stuff in the LLVM tutorial.

https://rustc-dev-guide.rust-lang.org/


Thank you for the references - I'll take a look.

Awesome jobs by Rust to keep something like that (hopefully updated too), since the lack of updated information usually is the bigger barrier of entry for contributing/working on stuff like that. I struggle to find something similar and in-depth for clang, but I guess the bigger complexity makes it more difficult.


I've found that the best way to learn LLVM IR is to write C code and compile it to LLVM IR to see how Clang does it.

This makes the learning process self-serve.

Ref: https://github.com/tanin47/lilit/blob/master/playground/READ...


In which category is Python's byte code compiler? Toy or production-grade?


I can second the 'make a calculator' as a place to get started with this, that's a solid project to get the basics of flexing -> parsing -> evaluation. Helped me learn a ton when I first wanted to approach this problem.

On a side note, I pretty annoyed by the prevelance of 'So you want to be a <insert something here>' titled articles. It's so common around now and just doesn't seem to express good intent about what is actually going to be in an article at this point.


S/flexing/lexing ? Nice typo :)


Presumably a pun on `flex`, the Gnu implementation of `lex`.


I would love to be this clever


Excerpt:

"Write snprintf in C. For those who haven’t used C before, snprintf is a function that produces formatted output based on an input string and a variable number of arguments. Doing it in C forces you to deal with constraints you may not have had to deal with in a higher-level language. Don’t skip out on writing unit tests! (And don’t bother with floating-point numbers; just handle %d and %s.)

const size_t bufferLength = 128;

char buffer[bufferLength];

snprintf(buffer, bufferLength, "%s %d %s", "first", 2, "last");

assert(0 == strcmp(buffer, "first 2 last"));

Write snprintf in assembly…for the exact same reason. Pretty much no one programs in assembly any more, and that’s generally a good thing, but this will (a) force you to learn a new and very suboptimal language, (b) get you to learn a little about your CPU2, and (c) help you later on if you ever need to debug a compiled program without debug info. Bonus points if you can get your assembly version to work correctly with C."


snprintf is merely a template engine. The parsing part is in sscanf. That's where it gets interesting.



The Crenshaw tutorial is also great reading: https://compilers.iecc.com/crenshaw/

It starts from the "build a calculator" angle which I think is a very good way to get into compilers, since it emphasises the recursive nature of things, and extending the calculator to a full programming language becomes easier that way.



From 2016, it looks like.


Added. Thanks!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: