Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

regex - Getting Everything Between Two Characters Across New Lines

This is a sample of the text I am working with.

6) Jake's Taxi Service is a new entrant to the taxi industry. It has achieved success by staking out a unique position in the industry. How did Jake's Taxi Service mostly likely achieve this position?

A) providing long-distance cab fares at a higher rate than competitors; servicing a larger area than competitors

B) providing long-distance cab fares at a lower rate than competitors; servicing a smaller area than competitors

C) providing long-distance cab fares at a higher rate than competitors; servicing the same area as competitors

D) providing long-distance cab fares at a lower rate than competitors; servicing the same area as competitors

Answer: D

I am trying to match the entire question including the answer options. Everything from the question number to the word Answer

This is my current regex expression

((rf'(?<={searchCounter}) ).*?(?=Answer).*'), re.DOTALL)

SearchCounter is just a variable that will correspond to the current question, in this case 6. I think the issue is something to do with searching across the new lines.

EDIT: Full source code

searchCounter = 1

bookDict = {}

with open ('StratMasterKey.txt', 'rt') as myfile:

    for line in myfile:
        question_pattern = re.compile((rf'(?<={searchCounter}) ).*?(?=Answer).*'), re.DOTALL) 

        result = question_pattern.search(line)
        if result != None: 
            bookDict[searchCounter] = result[0] 
            searchCounter +=1
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The reason your regex fails is that you read the file line by line with for line in myfile:, while your pattern searches for matches in a single multiline string.

Replace for line in myfile: with contents = myfile.read() and then use result = question_pattern.search(contents) to get the first match, or result = question_pattern.findall(contents) to get multiple matches.

A note on the regex: I am not fixing the whole pattern since, as you mentioned, it is out of scope of this question, but since the string input is a multiline string now, you need to remove re.DOTALL and use [sS] to match any char in the pattern and . to match any char but a line break char. Also, the lookaround contruct is redundant, you may safely replace (?=Answer) with Answer. Also, to check if there is a match, you may simply use if result: and then grab the whole match value by accessing result.group().

Full code snippet:

with open ('StratMasterKey.txt', 'rt') as myfile:
    contents = myfile.read()
    question_pattern = re.compile((rf'(?<={searchCounter}) )[sS]*?Answer.*')) 
    result = question_pattern.search(contents)
    if result: 
        print( result.group() )

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...