I'm trying to make a very simple lexical analyzer(Tokenizer) for C++ code from scratch, without using PLY or any other library.
Things I've done so far:
- Defined the keywords, operators in dictionaries.
- Defined the Regular Expressions for Comments, Literals, etc.
What I'm stuck with:
Problem 1:
Now I'm trying to make a function check_line(line)
which will consume a line of code and return the tokens in a Dictionary. For example:
check_line('int main()')
The output should be:
Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
But the output I'm getting is:
Tokens = {'Keyword':'main', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
Because main is overwriting int here.
Is there a way to tackle something like this?
Problem 2:
When I pass check_line('int main()')
inside the function, the program doesn't match main
because here we have parenthesis with it. How can I tackle this.
I'm pasting the code I've written so far, please have a look and let me know what you think.
import re
# Keywords
keywords = ['const','float','int','struct','break',
'continue','else','for','switch','void',
'case','enum','sizeof','typedef','char',
'do','if','return','union','while','new',
'public','class','friend','main']
# Regular Expression for Identifiers
re_id = '^[_]?[a-z]*[A-Z]([a-z]*[A-Z]*[0-9]+)'
# Regular Expression for Literals
re_int_lit = '^[+-]?[0-9]+'
re_float_lit = '^[+-]?([0-9]*).[0-9]+'
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
# Regular expression of Comments
re_singleline_comment = '^//[a-zA-Z0-9 ]*'
re_multiline_comment = '^/\*(.*?)\*/'
operators = {'=':'Assignment','-':'Subtraction',
'+':'Addition','*':'Multiplication',
'/':'Division','++':'increment',
'--':'Decrement','||':'OR', '&&':'AND',
'<<':'Cout operator','>>':'Cin Operator',
';':'End of statement'}
io = {'cin':'User Input',
'cout':'User Output'}
brackets = {'[':'Open Square',']':'Close Square',
'{':'Open Curly','}':'Close Curly',
'(':'Open Small',')':'Close Small'}
# Function
def check_line(line):
tokens = {}
words = line.split(' ')
for word in words:
if word in operators.keys():
tokens['Operator ' + word] = word
if word in keywords:
tokens['Keywords'] = word
if re.match(re_singleline_comment,word):
break
return tokens
check_line('int main()')
Output:
{'Keywords': 'main'}
The output should be:
Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
PS: I'm not done with the conditions yet, just trying to fix this first.