python - Parsing a PDF with no /Root object using PDFMiner

Question

Welcome To Ask or Share your Answers For Others

python - Parsing a PDF with no /Root object using PDFMiner

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:

ipython stack trace:

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.

Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.

Many thanks!

Edit:

I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:16:33+0000

The solution in slate pdf is use 'rb' --> read binary mode.

Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.

fp = open('C:UsersUSERworkspaceslate_minnerdocument1.pdf','rb')
doc = slate.PDF(fp)
print doc

Categories

python - Parsing a PDF with no /Root object using PDFMiner

python - Parsing a PDF with no /Root object using PDFMiner

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags