Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
96 views
in Technique[技术] by (71.8m points)

python - Regex only working on one line of a docx file

I am trying to read a docx file and extract data between certain words into a list. I would like to find all the instances where the data matches, which I do using regex. I only get an output if the data is on the same line, I reckon it's something to do with the type str printing after each space (no clue why this happens) example below:

Code below

import re
from docx import Document

document = Document('myfile.docx')
lst=[]
for para in document.paragraphs:
    orig = para.text
    orig= str(orig)
    print(type(orig))
    output= re.findall(r'sent1([^(]*)sent2',orig)
    print(re.findall(r'sent1([^(]*)sent2',orig))
    lst.append(output)

Output of my file on screen:

Heading


Some data here. sent1 this is my data xyz, hello sent2.


Heading 2

Another paragraph here with spaced below.

Output of my file when showing the type. It's a string I have no idea why it's printing like this:

<class 'str'>
My data here
<class 'str'>
sent 1 and more data this space
<class 'str'>
sent2 here
sent1 example2 sent2

Desired output (list of all the characters captured between sent1 and sent2 through the document)

output=['and more data this space', 'example2']

Current output

output=['example2']
question from:https://stackoverflow.com/questions/65909963/regex-only-working-on-one-line-of-a-docx-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Well, I'd just merge everything into one giant string and regex match on that. E.g. so something like this:

from docx import Document
document = Document('myfile.docx')
 
fulltext = []
for para in document.paragraphs:
    fullText.append(paragraph.text)
fulltext = ' '.join(fulltext)

output = re.findall(r'word1 .* word2', fulltext)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...