python - Regex only working on one line of a docx file

Question

Welcome To Ask or Share your Answers For Others

python - Regex only working on one line of a docx file

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Regex only working on one line of a docx file

I am trying to read a docx file and extract data between certain words into a list. I would like to find all the instances where the data matches, which I do using regex. I only get an output if the data is on the same line, I reckon it's something to do with the type str printing after each space (no clue why this happens) example below:

Code below

import re
from docx import Document

document = Document('myfile.docx')
lst=[]
for para in document.paragraphs:
    orig = para.text
    orig= str(orig)
    print(type(orig))
    output= re.findall(r'sent1([^(]*)sent2',orig)
    print(re.findall(r'sent1([^(]*)sent2',orig))
    lst.append(output)

Output of my file on screen:

Heading


Some data here. sent1 this is my data xyz, hello sent2.


Heading 2

Another paragraph here with spaced below.

Output of my file when showing the type. It's a string I have no idea why it's printing like this:

<class 'str'>
My data here
<class 'str'>
sent 1 and more data this space
<class 'str'>
sent2 here
sent1 example2 sent2

Desired output (list of all the characters captured between sent1 and sent2 through the document)

output=['and more data this space', 'example2']

Current output

output=['example2']

question from:https://stackoverflow.com/questions/65909963/regex-only-working-on-one-line-of-a-docx-file

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:12:30+0000

Well, I'd just merge everything into one giant string and regex match on that. E.g. so something like this:

from docx import Document
document = Document('myfile.docx')
 
fulltext = []
for para in document.paragraphs:
    fullText.append(paragraph.text)
fulltext = ' '.join(fulltext)

output = re.findall(r'word1 .* word2', fulltext)

Categories

python - Regex only working on one line of a docx file

python - Regex only working on one line of a docx file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags