Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
92 views
in Technique[技术] by (71.8m points)

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.

I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.

How would I go about finding multiple instances of the word "time"?

This is my code:

import PyPDF2

def pdf_read():
    pdfFile = "recordsdocument.pdf"
    
    pdf = PyPDF2.PdfFileReader(pdfFile)
    pageCount = pdf.getNumPages()
    
    for pageNumber in range(pageCount):
        page = pdf.getPage(pageNumber)
        pageContent = page.extractText()   
        if "Time" in pageContent or "time" in pageContent:
            print(pageNumber)

Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?

question from:https://stackoverflow.com/questions/65851174/how-to-extract-multiple-instances-of-a-word-from-pdf-files-on-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:

import PyPDF2
import string

pdfFile = "recordsdocument.pdf"

pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()

for pageNumber in range(pageCount):
    page = pdf.getPage(pageNumber)
    pageContent = page.extractText()   
    pageContent = ''.join(pageContent.splitlines()).split() # words to list
    pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation

    print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
    print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word

Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...