Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
410 views
in Technique[技术] by (71.8m points)

pdf - How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX

I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF

  • Artificial Bold style text
  • Artificial italic style text.
  • Artificial outline style text

I did search in PDFBOX api list but was unable to find such kind of api.

Can anyone please help me out and tell how to determine different types of artificial font/text styles to be present in a PDF using PDFBOX.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The general procedure and a PDFBox issue

In theory one should start this by deriving a class from PDFTextStripper and overriding its method:

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

Your override then should use List<TextPosition> textPositions instead of the String text; each TextPosition essentially represents a single a single letter and the information on the graphic state active when that letter was drawn.

Unfortunately the textPositions list does not contain the correct contents in the current version 1.8.3. E.g. for the line "This is normal text." from your PDF the method writeString is called four times, once each for the strings "This", " is", " normal", and " text." Unfortunately the textPositions list each time contains the TextPosition instances for the letters of the last string " text."

This actually proved to have already been recognized as PDFBox issue PDFBOX-1804 which meanwhile has been resolved as fixed for versions 1.8.4 and 2.0.0.

This been said, as soon as you have a PDFBox version which is fixed, you can check for some artificial styles as follows:

Artificial italic text

This text style is created like this in the page content:

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

The relevant part happens in setting the text matrix Tm. The 5.10137 is a factor by which the text is sheared.

When you check a TextPosition textPosition as indicated above, you can query this value using

textPosition.getTextPos().getValue(1, 0)

If this value relevantly is greater than 0.0, you have artificial italics. If it is relevantly less than 0.0, you have artificial backwards italics.

Artificial bold or outline text

These artificial styles use double printing letters using differing rendering modes; e.g. the capital 'T', in case of bold:

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

(i.e. first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.)

and in case of outline:

BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w 
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET

(i.e. first drawing the letter in regular mode white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.)

Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode. Furthermore it explicitly drops duplicate character occurrences in approximately the same position. Thus, it is not up to the task of recognizing these artificial styles.

If you really need to do so, you'd have to change TextPosition to also contain the rendering mode, PDFStreamEngine to add it to the generated TextPosition instances, and PDFTextStripper to not drop duplicate glyphs in processTextPosition.

Corrections

I wrote

Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode.

This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode(). This means that during processTextPosition you do have the rendering mode available and can try and store rendering mode (and color!) information for the given TextPosition somewhere, e.g. in some Map<TextPosition, ...>, for later use.

Furthermore it explicitly drops duplicate character occurrences in approximately the same position.

You can disable this by calling setSuppressDuplicateOverlappingText(false).

With these two changes you should be able to make the required tests for checking for artificial bold and outline, too.

The latter change might even not be necessary if you store and check for the styles early in processTextPosition.

How to retrieve rendering mode and color

As mentioned in Corrections it indeed is possible to retrieve rendering mode and color information by collecting that information in a processTextPosition override.

To this the OP commented that

Always the stroking and non-stroking color is coming as Black

This was a bit surprising at first but after looking at the PDFTextStripper.properties (from which the operators supported during text extraction are initialized), the reason became clear:

# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k

Thus color setting operators (especially those for CMYK colors as in the present document) are ignored in this context! Fortunately the implementations of these operators for the PageDrawer can be used in this context, too.

So the following proof-of-concept shows how all required information can be retrieved.

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '
');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('
');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

Using this you get the period '.' in normal text as:

. - shear by 0.0 - 256.5701 88.6875 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

In artificial bold text you get;

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

In artificial italics:

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

And in artificial outline:

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

So, there you are, all information required for recognition of those artificial styles. Now you merely have to analyze the data.

BTW, have a look at the artificial bold case: The coordinates might not always be identical but instead merely very similar. Thus, some leniency is required for the test whether two text position objects describe the same position.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...