Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
57 views
in Technique[技术] by (71.8m points)

Python Regex Fails on 2 Edge Cases

I am trying to write a regex to split a string into what I call 'terms' (e.g. words, numbers, and surrounding spaces) and 'logical operators' (e.g. <AND, &>, <OR, |>, <NOT,-,~>, <(,{,[,),},]>). For this question, we can ignore the alternative symbols for AND, OR, and NOT, and grouping is just with '(' and ')'.

For example:

Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)

should be split into this Python list:

["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]

My code:

pattern = r"(NOT|-|~)?s*((|[|{)?s*(NOT|-|~)?s*([w+s*]*)s+(AND|&|OR||)?s+(NOT|-|~)?s*([w+s*]*)s*()|]|})?"  
t = re.split(pattern, text)
raw_terms = list(filter(None, t))

The pattern works for this test case, the one above, and others,

NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']

but not these:

NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']

I have tried changing the two s+ to s*, but not all test cases passed. I am not a regex expert (this one is the most complicated one I have tried).

I am hoping someone can help me understand why these two test cases fail, and how to fix the regex so all the test cases pass.

Thanks,

Mark

question from:https://stackoverflow.com/questions/65836273/python-regex-fails-on-2-edge-cases

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use

re.split(r's*((?:AND|OR|NOT)|[()])s*', string)

See regex proof.

Explanation

--------------------------------------------------------------------------------
  s*                      whitespace (
, 
, , f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to 1:
--------------------------------------------------------------------------------
                           the boundary between a word char (w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      AND                      'AND'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      OR                       'OR'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      NOT                      'NOT'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
                           the boundary between a word char (w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [()]                     any character of: '(', ')'
--------------------------------------------------------------------------------
  )                        end of 1
--------------------------------------------------------------------------------
  s*                      whitespace (
, 
, , f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code:

import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r's*((?:AND|OR|NOT)|[()])s*', string)
output = list(filter(None, output))
print(output)

Results: ['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...