Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
302 views
in Technique[技术] by (71.8m points)

Facing an issue while making a lexical analyzer for C++ code in Python

I'm trying to make a very simple lexical analyzer(Tokenizer) for C++ code from scratch, without using PLY or any other library.

Things I've done so far:

  • Defined the keywords, operators in dictionaries.
  • Defined the Regular Expressions for Comments, Literals, etc.

What I'm stuck with:

Problem 1:

Now I'm trying to make a function check_line(line) which will consume a line of code and return the tokens in a Dictionary. For example:

check_line('int main()')

The output should be:

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

But the output I'm getting is:

Tokens = {'Keyword':'main', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

Because main is overwriting int here.

Is there a way to tackle something like this?

Problem 2:

When I pass check_line('int main()') inside the function, the program doesn't match main because here we have parenthesis with it. How can I tackle this.

I'm pasting the code I've written so far, please have a look and let me know what you think.

import re

# Keywords
keywords = ['const','float','int','struct','break',
            'continue','else','for','switch','void',
            'case','enum','sizeof','typedef','char',
            'do','if','return','union','while','new',
            'public','class','friend','main']


# Regular Expression for Identifiers
re_id = '^[_]?[a-z]*[A-Z]([a-z]*[A-Z]*[0-9]+)'

# Regular Expression for Literals
re_int_lit = '^[+-]?[0-9]+'
re_float_lit = '^[+-]?([0-9]*).[0-9]+'
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'

# Regular expression of Comments
re_singleline_comment = '^//[a-zA-Z0-9 ]*'
re_multiline_comment = '^/\*(.*?)\*/'

operators = {'=':'Assignment','-':'Subtraction',
             '+':'Addition','*':'Multiplication',
            '/':'Division','++':'increment',
            '--':'Decrement','||':'OR', '&&':'AND',
            '<<':'Cout operator','>>':'Cin Operator',
            ';':'End of statement'}

io = {'cin':'User Input',
      'cout':'User Output'} 

brackets = {'[':'Open Square',']':'Close Square',
           '{':'Open Curly','}':'Close Curly',
           '(':'Open Small',')':'Close Small'}


# Function

def check_line(line):
    tokens = {}
    words = line.split(' ')
    for word in words:
        if word in operators.keys():
            tokens['Operator ' + word] = word

        if word in keywords:
            tokens['Keywords'] = word
        
        if re.match(re_singleline_comment,word):
            break
       
    return tokens


check_line('int main()')

Output:

{'Keywords': 'main'}

The output should be:

Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}

PS: I'm not done with the conditions yet, just trying to fix this first.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A dictionary is a really bad choice of data structure for this function, since the essence of a dictionary is that each key is associated with exactly one corresponding value.

What a tokenizer should return is quite different: an ordered stream of token objects. In a simple implementation, that might be a list of tuples, but for any non-trivial application, you'll soon find that:

  1. Tokens are not just a syntactic type and a string. There's lots of important auxiliary information, most notably the location of the token in the input stream (for error messages).

  2. Tokens are almost always consumed in sequence, and there is no particular advantage in producing more than one at a time. In Python, a generator is a much more natural way of producing a stream of tokens. If it were useful to create a list of tokens (for example, to implement a back-tracking parser), there would be no point working line by line, since line breaks are generally irrelevant in C++.

As noted in a comment, C++ tokens are not always separated by whitespace, as is evident in your example input. (main() is three tokens without containing a single space character.) The best way of splitting program text into a token stream is to repeatedly match token patterns at the current input cursor, return the longest match, and move the input cursor over the match.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...