python - Converting PDF to images automatically

Question

Welcome To Ask or Share your Answers For Others

python - Converting PDF to images automatically

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Converting PDF to images automatically

So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.

Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:45:34+0000

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

Categories

python - Converting PDF to images automatically

python - Converting PDF to images automatically

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags