Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
353 views
in Technique[技术] by (71.8m points)

python - pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:

http://www.4shared.com/document/kmJF67E4/forms.html

If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?

from pyPdf import PdfFileReader  
input = PdfFileReader(file("forms.pdf", "rb"))  
for page in input1.pages:  
    print page.extractText()  
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Since it is the text you want, you can use the Linux command pdftotext.

To invoke that using Python, you can do this:

>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])

The text is extracted from forms.pdf and saved to output.

This works in the case of your PDF file and extracts the text you want.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...