parsing - Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

Question

Welcome To Ask or Share your Answers For Others

parsing - Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

parsing - Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

I am writing a lexer for a simple language(Gherkin).

While some of the lexer is done, I am struggling with a design decision.

Currently, the lexer has an examples and a step mode. That means it has to track context, which I would rather not do. I want to make the lexer as dumb as possible, so that most of the work is done by the parser.

My problem with the current approach is that I don't know if the lexer should distinguish Syntax and Literals in certain cases.

For a better understanding, here is a brief overview of the language.

The language has syntax tokens like: : < > | @.
The language can have variables, written as <Name>.
The language has an examples section, where syntax tokens differ from the rest of the test case

An example table looks like this:

Examples:
| Name | Last Name |
| John | Doe |

A full(stripped out unneeded information) test written in Gherkin looks like this:

@Fancy-Test
    Scenario Outline: User logs in 
    
    Given user is on login_view
    And user enters <Username> in username_field
    And user enters <Password> in password_field
    And user answers <Qu|estion>
    When user clicks on login_button
    Then user is logged in
    
    Examples:
    |Username|Password|Qu|estion|
    |JohnDoe11|Test<Pass>@@Word|Who am I|

Note how I escaped | in the first Examples column.

Also take note of all the syntax characters in the password example.

By escaping the | character, I can use it in the examples part of the test without it getting detected as a Syntax Token.

But for the variable in line And user answers <Qu|estion> I don't need or want to escape it. By language specification, the example entries can contain any character, except |, unless escaped, as it marks the end of a column.

That means no other syntax character should be detected as a Syntax Token. Without two modes, all the syntax characters in the password example would be detected as such tokens.

The opposite is the case for the other part of the tests. Unless at the start of a new line(where @ and : are Syntax Tokens), only <> should be considered part of the syntax

The current implementation prevents this by having the two modes mentioned, which is not the best solution.

My question therefore is: Should the lexer just detect it as Syntax Tokens, which then get picked up by the Parser which figures out that those are actualyl part of the literal ? Or is having context the preferable way.

Thank you for answering.

question from:https://stackoverflow.com/questions/65878328/should-a-lexer-be-able-to-distinguish-between-syntax-tokens-contained-in-a-vari

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:21:31+0000

If you have two different lexical environments, then you have two difference lexical environments. They need to be handled differently. Almost all real-world programming languages feature this kind of complication, and most lexical generators have mechanisms designed to help maintain a moderate amount of lexical state.

The problem is figuring out how to do the transitions between the different lexical contexts. As you note, that can be a lot of work, which is ugly. If it's really ugly, you might want to revisit your language design, because it is not just your parser which has to be able to predict which lexical context applies where: any human being reading the code also needs to understand that, and all of the subtleties built in to the algorithm. If you can't describe the algorithm in a couple of clear sentences, you'll be putting quite a burden on code readers.

In the case of Gherkin, it looks to me like the tables are fairly easy to recognise: they start with a line whose first token is | and presumably continue until you reach a line whose first token is not a |. So it should be pretty straight-forward to switch lexical contexts, particularly as your lexer probably already needs to be aware of line-endings.

Categories

parsing - Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

parsing - Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags