Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
622 views
in Technique[技术] by (71.8m points)

pdf - iText - Cleaning Up Text in Rectangle without cleaning full row

I'm trying to clean up text inside rectangle in pdf document using iText.

Following is the piece of code I’m using:

PdfReader pdfReader = null;
PdfStamper stamper = null;
try 
{
    int pageNo = 1;

    List<Float> linkBounds = new ArrayList<Float>();
    linkBounds.add(0, (float) 202.3);
    linkBounds.add(1, (float) 588.6);
    linkBounds.add(2, (float) 265.8);
    linkBounds.add(3, (float) 599.7);

    pdfReader = new PdfReader("Test1.pdf");
    stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));

    Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));

    List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
    cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
    cleaner.cleanUp();
}
catch (Exception e)
{
    e.printStackTrace();
}
finally
{
    try {
        stamper.close();
    }
    catch (Exception e) {
        e.printStackTrace();
    }
    pdfReader.close();
}

After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.

To explain things in a better way I have attached pdf documents.

In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.

And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.

Any help will be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The files input.pdf and output.pdf the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.

The second set of files Test1.pdf and Test2.pdf, though, did allow to reproduce the issue, giving rise to the updated answer...

Updated answer referring to the OP's second set of sample files

There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of PdfContentByte used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.

In more detail:

PdfCleanUpContentOperator.writeTextChunks used canvas.setCharacterSpacing(0) and canvas.setWordSpacing(0) to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set by beginText; but during clean-up text objects are not started using that method. Thus, writeTextChunks here inserts an extra "BT 1 0 0 1 0 0 Tm" sequence damaging the stream and relocating the following text.

private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                             float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
    canvas.setCharacterSpacing(0);
    canvas.setWordSpacing(0);
    ...

PdfCleanUpContentOperator.writeTextChunks instead should use hand-crafted Tc and Tw instructions to not trigger this side effect.

private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
                             float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
    if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
        new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
        canvas.getInternalBuffer().append(Tc);
    }
    if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
        new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
        canvas.getInternalBuffer().append(Tw);
    }
    canvas.getInternalBuffer().append((byte) '[');

With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code

@Test
public void testRedactJavishsTest1() throws IOException, DocumentException
{
    try (   InputStream resource = getClass().getResourceAsStream("Test1.pdf");
            OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
    {
        PdfReader reader = new PdfReader(resource);
        PdfStamper stamper = new PdfStamper(reader, result);

        List<Float> linkBounds = new ArrayList<Float>();
        linkBounds.add(0, (float) 202.3);
        linkBounds.add(1, (float) 588.6);
        linkBounds.add(2, (float) 265.8);
        linkBounds.add(3, (float) 599.7);

        Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
        List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
        cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));

        PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
        cleaner.cleanUp();

        stamper.close();
        reader.close();
    }
}

(RedactText.java)

Original answer referring to the OP's original sample files

I just tried to reproduce your issue using this test method

@Test
public void testRedactJavishsText() throws IOException, DocumentException
{
    try (   InputStream resource = getClass().getResourceAsStream("input.pdf");
            OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
    {
        PdfReader reader = new PdfReader(resource);
        PdfStamper stamper = new PdfStamper(reader, result);

        List<Float> linkBounds = new ArrayList<Float>();
        linkBounds.add(0, (float) 200.7);
        linkBounds.add(1, (float) 547.3);
        linkBounds.add(2, (float) 263.3);
        linkBounds.add(3, (float) 558.4);

        Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
        List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
        cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));

        PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
        cleaner.cleanUp();

        stamper.close();
        reader.close();
    }
}

(RedactText.java)

For your source PDF looking like this

input.pdf

the result was

input-redactedJavish.pdf

and not your

output.pdf

I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.

Thus, I cannot reproduce your issue.


I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.

Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.

In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:

q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
[...] TJ

and like this in the version I received directly from iText:

q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
0 Tc
0 Tw 
[...] TJ

but the corresponding lines in your output.pdf look like this

BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc 
[...] TJ

Here the instructions in your output.pdf are

  • invalid as inside a text object BT ... ET there may be no other text object but you have two BT operations following each other without an ET inbetween;
  • effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.

And indeed, if you look at the bottom of your output.pdf page you'll see:

output.pdf, page bottom

So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.

If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...