Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
235 views
in Technique[技术] by (71.8m points)

java - PDF find out if text is underlined or a table cell

I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.

As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.

Here is my code so far:

List<TextPosition> textPos = charactersByArticle.get(index);

for (TextPosition t : textPos)
{               
    if (t.getFont().getFontDescriptor() != null)
    {                           
        if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
            t.getFont().getFontDescriptor().isForceBold())
        {
            isBold = true;
        }

        if (t.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }
    }
}

I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.

Any suggestions where this information could be retrieved from ?

question from:https://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is what I have found out so far:

PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

If we take a look at the PDFTextStripper.properties resource file under:

pdfboxsrcmain esourcesorgapachepdfbox esources

we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

The PDFTextStripper under

pdfboxsrcmainjavaorgapachepdfboxutil

takes this into account and utilizes the processing of the PDF with this classes.

BUT all graphical objects are ignored, therefore no information of underline or table structure!

Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

pdfboxsrcmainjavaorgapachepdfboxpdfviewer

The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

Now this would mean reading the PDF file specification, which is currently way to much work.

If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...