Hacker News new | past | comments | ask | show | jobs | submit login

So today I have been writing a lexer and parser. The public interface is the parser, the lexer isn't exposed.

The problem is if I delete all the tests for the lexer then any bugs in the lexer will only get exposed through the parser's tests.

This makes no sense to me.




The lexer is a unit then.

The lexer has a clear boundary from the parser.

The issue that takes experience here is how to determine what's a unit. "The whole program" is obviously too big. "every public method or function" is obviously too small.

Just be pragmatic.


> "The whole program" is obviously too big.

Of course.

> "every public method or function" is obviously too small.

Why "obviously"? If it's public, someone outside the class can call it. That's an external behavior.


If the class is only consumed in the context of one code unit (module, service, whatever) then the class itself is an implementation detail.


"every" being the operative word.

The "feel" for good code that comes with experience is not reducible in practice to a set of black-and-white rules.


Ideally, the lexer should be a system in of itself, exposing a public interface that is consumed by its client, the parser.

Public doesn't necessarily mean "not you".


Even if your code never graduates to being used by multiple teams in your project or on others, “You” can turn into “you and your mentee” anyway, if you’re playing your cards right.


Or more trivially, "you and you half a year from now".


Every feature of the lexer should be testable through test cases written in the syntax of the language. That includes handling of bad lexical syntax also. For instance, a malformed floating-point constant or a string literal that is not closed are testable without having to treat the lexer as a unit. It should be easy to come up with valid syntax that exercises every possible token kind, in all of its varieties.

For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.

If there is a lexical analysis case (whether a successful token extraction or an error) that is somehow not testable through the parser, then that is dead code.

The division of the processing of a language into "parser" and "lexer" is arbitrary; it's an implementation detail which has to do with the fact that lexing requires lookahead and backtracking over multiple characters (and that is easily done with buffering techniques), whereas the simplest and fastest parsing algorithms like LALR(1) have only one symbol of lookahead.

Parsers and lexers sometimes end up integrated, in that the lexer may not know what to do without information from the parser. For instance a lex-generated lexer can have states in the form of start conditions. The parser may trigger these. That means that to get into certain states of the lexer, either the parser is required, or you need a mock up of that situation: some test-only method that gets into that state.

Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.


For any token kind, it should be easy to come up with a minimal piece of syntax which includes that token.

There is the problem, any tests that fail in the lexer now reach down through the parser to the lexer. The test is too far away from the point of failure. I'll now spend my time trying to understand a problem that would have been obvious when the lexer was being tested directly.

>Basically, treating the lexer part of a lexer/parser combo as public interface is rarely going to be a good idea.

This is part of the original point, the parser is the public interface which is why the OP was suggesting it should be the only contact point for the tests.


When a test fails, your understanding is informed by the nature of the code change that is responsible.

If you keep code changes small, and keep tests working, you're good.


Lexer/Parsers are one of the few software engineering tasks I do routinely where it's self evident that TDD is useful and the tests will remain useful afterwards.


Indeed! I recall a lexer and parser built via TDD with a test suite that specified every detail of a DSL. A few years later, both were rewritten completely from scratch while all the tests stayed the same. When we got to passing all tests, it was working exactly as before, only much more efficiently.

From that experience, I would say that in some contexts, tests shouldn't be removed unless what it's testing is no longer being used.


So what?

If you have a good answer to that, then the lexer is separate (as others said). If you don't then wirte parser tests for the lexer so that you can more easily refractor the interface between them.

There is no on right answer, only trade-offs. You need to make the right decision for you. (though I will note that there is probably a good reason parse and lex are generally separated and that probably means that the best tradeoffs for you is they are separate. But if you decide different you are not necessarily wrong)


So what?

Well, I was responding to "you should delete tests for everything that isn't a required external behavior".


If bugs in the lexer never cause the parser to fail for any possible input, does it really have bugs? ;-)

Or, as @VHRanger pointed out, the lexer can be considered a unit and be tested independently.


Sounds like your lexer has a public interface to you.


these are rules of thumb, not laws


The trouble is they get presented as laws and then some jobsworth will make damn sure you are following the rules.


Rules of thumb are just low-fidelity windows allowing glimpses of poorly researched, not yet understood laws.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: