I am trying to extract text from a foreign language PDF file using PDFMiner, but am being foiled by a ToUnicode statement. The file behaves strangely even under normal PDF viewers.
For example, here is a screenshot from some text in the file:
But if I select and copy the text, it looks like this:
?????
You can see several characters have changed, in particular the second-to-last character.
Not surprisingly, PDFMiner extracts the incorrect text. But every PDF viewer manages to display these data correctly. I suspect the issue is either the ToUnicode map, or something with conjoined characters. The desired letter should be a sequence of 0x915, 0x94D, 0x937. PDFMiner only reports 0x915, which describes a different character.
What do I need to do to get PDFMiner to extract text correctly, i.e. as in the image rather than the copy-pasted text?
Here is a link to the PDF in question.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…