Note that extractText()
still has problems extracting the text properly. From the documentation for extractText()
:
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext
.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf
and saved to output
.
This works in the case of your PDF file and extracts the text you want.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…