Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
667 views
in Technique[技术] by (71.8m points)

a Regex for extracting sentence from a paragraph in python

I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.

The paragraph:

"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine?

The code:

def splitParagraphIntoSentences(paragraph):

import re

sentenceEnders = re.compile('[.!?][s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
    f = open("bs.txt", 'r')
    text = f.read()
    mylist = []
    sentences = splitParagraphIntoSentences(text)
    for s in sentences:
        mylist.append(s.strip())
        for i in mylist:
            print i

When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-

But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections

A new type of vaccine

Is there anything wrong with the regular expression?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.

There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:

def splitParagraphIntoSentences(paragraph):
    import re
    sentenceEnders = re.compile(r"""
        # Split sentences on whitespace between them.
        (?:               # Group for two positive lookbehinds.
          (?<=[.!?])      # Either an end of sentence punct,
        | (?<=[.!?]['"])  # or end of sentence punct and quote.
        )                 # End group of two positive lookbehinds.
        (?<!  Mr.   )    # Don't end sentence on "Mr."
        (?<!  Mrs.  )    # Don't end sentence on "Mrs."
        (?<!  Jr.   )    # Don't end sentence on "Jr."
        (?<!  Dr.   )    # Don't end sentence on "Dr."
        (?<!  Prof. )    # Don't end sentence on "Prof."
        (?<!  Sr.   )    # Don't end sentence on "Sr."
        s+               # Split on whitespace between sentences.
        """, 
        re.IGNORECASE | re.VERBOSE)
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    f = open("bs.txt", 'r')
    text = f.read()
    mylist = []
    sentences = splitParagraphIntoSentences(text)
    for s in sentences:
        mylist.append(s.strip())
    for i in mylist:
        print i

You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"

But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...