I am trying to read a docx file and extract data between certain words into a list. I would like to find all the instances where the data matches, which I do using regex. I only get an output if the data is on the same line, I reckon it's something to do with the type str printing after each space (no clue why this happens) example below:
Code below
import re
from docx import Document
document = Document('myfile.docx')
lst=[]
for para in document.paragraphs:
orig = para.text
orig= str(orig)
print(type(orig))
output= re.findall(r'sent1([^(]*)sent2',orig)
print(re.findall(r'sent1([^(]*)sent2',orig))
lst.append(output)
Output of my file on screen:
Heading
Some data here. sent1 this is my data xyz, hello sent2.
Heading 2
Another paragraph here with spaced below.
Output of my file when showing the type. It's a string I have no idea why it's printing like this:
<class 'str'>
My data here
<class 'str'>
sent 1 and more data this space
<class 'str'>
sent2 here
sent1 example2 sent2
Desired output (list of all the characters captured between sent1 and sent2 through the document)
output=['and more data this space', 'example2']
Current output
output=['example2']
question from:
https://stackoverflow.com/questions/65909963/regex-only-working-on-one-line-of-a-docx-file 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…