I am writing a lexer for a simple language(Gherkin).
While some of the lexer is done, I am struggling with a design decision.
Currently, the lexer has an examples and a step mode.
That means it has to track context, which I would rather not do.
I want to make the lexer as dumb as possible, so that most of the work is done by the parser.
My problem with the current approach is that I don't know if the lexer should distinguish Syntax and Literals in certain cases.
For a better understanding, here is a brief overview of the language.
The language has syntax tokens like: : < > | @
.
The language can have variables, written as <Name>
.
The language has an examples section, where syntax tokens differ from the rest of the test case
An example table looks like this:
Examples:
| Name | Last Name |
| John | Doe |
A full(stripped out unneeded information) test written in Gherkin looks like this:
@Fancy-Test
Scenario Outline: User logs in
Given user is on login_view
And user enters <Username> in username_field
And user enters <Password> in password_field
And user answers <Qu|estion>
When user clicks on login_button
Then user is logged in
Examples:
|Username|Password|Qu|estion|
|JohnDoe11|Test<Pass>@@Word|Who am I|
Note how I escaped |
in the first Examples column.
Also take note of all the syntax characters in the password example.
By escaping the |
character, I can use it in the examples part of the test without it getting detected as a Syntax Token.
But for the variable in line And user answers <Qu|estion>
I don't need or want to escape it.
By language specification, the example entries can contain any character, except |
, unless escaped, as it marks the end of a column.
That means no other syntax character should be detected as a Syntax Token.
Without two modes, all the syntax characters in the password example would be detected as such tokens.
The opposite is the case for the other part of the tests.
Unless at the start of a new line(where @
and :
are Syntax Tokens),
only <>
should be considered part of the syntax
The current implementation prevents this by having the two modes mentioned, which is not the best solution.
My question therefore is:
Should the lexer just detect it as Syntax Tokens, which then get picked up by the Parser which figures out that those are actualyl part of the literal ?
Or is having context the preferable way.
Thank you for answering.
question from:
https://stackoverflow.com/questions/65878328/should-a-lexer-be-able-to-distinguish-between-syntax-tokens-contained-in-a-vari