python 3.x - How can i read a PDF file from inline raw_bytes (not from file)?

Question

Welcome To Ask or Share your Answers For Others

python 3.x - How can i read a PDF file from inline raw_bytes (not from file)?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - How can i read a PDF file from inline raw_bytes (not from file)?

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

This prints a list of strings in the PDF file from the url provided.

Just wondering if i can convert the my_raw_data variable to a readable string?

Thanks so much in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:59:11+0000

you can use io

import requests, PyPDF2, io

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)

with io.BytesIO(response.content) as open_pdf_file:
    read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
    num_pages = read_pdf.getNumPages()
    print(num_pages)

PS. To open files, always use a context manager (with-statement)

Categories

python 3.x - How can i read a PDF file from inline raw_bytes (not from file)?

python 3.x - How can i read a PDF file from inline raw_bytes (not from file)?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags