I am trying to write a regex to split a string into what I call 'terms' (e.g. words, numbers, and surrounding spaces) and 'logical operators' (e.g. <AND, &>, <OR, |>, <NOT,-,~>, <(,{,[,),},]>). For this question, we can ignore the alternative symbols for AND, OR, and NOT, and grouping is just with '(' and ')'.
For example:
Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)
should be split into this Python list:
["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]
My code:
pattern = r"(NOT|-|~)?s*((|[|{)?s*(NOT|-|~)?s*([w+s*]*)s+(AND|&|OR||)?s+(NOT|-|~)?s*([w+s*]*)s*()|]|})?"
t = re.split(pattern, text)
raw_terms = list(filter(None, t))
The pattern works for this test case, the one above, and others,
NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']
but not these:
NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']
I have tried changing the two s+ to s*, but not all test cases passed. I am not a regex expert (this one is the most complicated one I have tried).
I am hoping someone can help me understand why these two test cases fail, and how to fix the regex so all the test cases pass.
Thanks,
Mark
question from:
https://stackoverflow.com/questions/65836273/python-regex-fails-on-2-edge-cases