Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
213 views
in Technique[技术] by (71.8m points)

Python Regex Sentence Finder-Want to Ignore "a.m."

I am developing a regex to find sentences, and I would like to ignore abbreviations that cause the regex to terminate before the end of the sentence. For example, I want to ignore "a.m." so that it returns "At 9:00 a.m. the store opens." instead of "At 9:00 a.m."

def sentence_finder(x):
    RegexObject = re.compile(r'[A-Z].+?(?!a.m.)w+[.?!](?!S)')
    Variable = RegexObject.findall(x)
    return Variable

I get back the following when I run pytest:

def test_pass_Ignore_am():
>       assert DuplicateSentences.sentence_finder("At 9:00 a.m. the store opens.") == ["At 9:00 a.m. the store opens."]
E       AssertionError: assert ['At 9:00 a.m.'] == ['At 9:00 a.m...store opens.']
E         At index 0 diff: 'At 9:00 a.m.' != 'At 9:00 a.m. the store opens.'

What am I doing wrong?

question from:https://stackoverflow.com/questions/65911590/python-regex-sentence-finder-want-to-ignore-a-m

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could use a negative lookbehind to check that after matching a dot, there is not a.m. before it.

[A-Z].*?w[.?!](?<!a.m.)(?!S)

Explanation

  • [A-Z] Match a char A-Z
  • .*? Match 0+ times any char except a newline as least as possible
  • w[.?!] Match a word char followed by either . ? or !
  • (?<!a.m.) Negative lookbehind to assert that directly to the left is not a.m.
  • (?!S) Assert a whitespace boundary to the right

Regex demo


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...