c# - How to read text of appearance stream?

Question

Welcome To Ask or Share your Answers For Others

c# - How to read text of appearance stream?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - How to read text of appearance stream?

I have a PDF where the text shown in an annotation (as rendered in Adobe Reader) is different than what is given by its /Contents and /RC entries. This is related to the problem that I was dealing with in this question:

Can't change /Contents of annotation

In this case, instead of changing the appearance to match the annotation's contents, I want to do the opposite: get the appearance text and change the /Contents and /RC values to match. E.g., if the annotation displays "appearance" and /Contents is set to "content", I want to do something like:

void setContent(PdfDictionary dict)
{
 PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));
 dict.Put(PdfName.CONTENTS,str);
}

But I can't find where the appearance text is stored. I got the dictionary referenced by /AP with this code:

private PdfDictionary getAPAnnot(PdfArray annotArray,PdfDictionary annot)
        {
            PdfDictionary apDict = annot.GetAsDict(PdfName.AP);
            if (apDict!=null)
            {
                PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
                PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);
                return apRefDict;
            }
            else
            {
                return null;
            }
        }

This dictionary has the following hashMap:

{[/BBox, [-38.7578, -144.058, 62.0222, 1]]} 
{[/Filter, /FlateDecode]}   
{[/Length, 172]}    
{[/Matrix, [1, 0, 0, 1, 0, 0]]} 
{[/Resources, Dictionary]}

/Resources has indirect references to the fonts, but no contents. So it seems that the appearance stream doesn't include content data.

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:33:52+0000

Unfortunately the OP has not provided a sample PDF. Considering his previous question, though, he is most likely interested in free text annotations. Thus, I use this example PDF here as example. It has one page with a typewriter free text annotation looking like this:

The OP asked

Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?

The major shortcoming of the OP's code is that he only considered the normal appearance as PdfDictionary:

PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);

It actually is a PdfStream, i.e. a dictionary with a data stream, and this data stream is where the appearance drawing instructions are located.

But even with this data stream at hand, it is not as simple as imagined by the OP:

PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));

Actually the text in the appearance stream can be drawn in pieces, e.g. in my sample file the stream data look like this:

0 w
131.2646 564.8243 180.008 30.984 re
n
q
1 0 0 1 0 0 cm
131.2646 564.8243 180.008 30.984 re
W
n
0 g
1 w
BT
/Cour 12 Tf
0 g
131.265 587.96 Td
(This ) Tj
35.999 0 Td
(is ) Tj
21.6 0 Td
(written ) Tj
57.599 0 Td
(using ) Tj
43.2 0 Td
(the ) Tj
-158.398 -16.3 Td
(typewriter ) Tj
79.199 0 Td
(tool.) Tj
ET
Q

Furthermore, the encoding does not need to be some standard encoding like here but can instead be defined for an embedded font on-the-fly.

Thus, one has to apply full-fledged text extraction.

This all can be implemented like this:

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    Console.Write("
Page {0}
", page);
    PdfDictionary pageDictionary = pdfReader.GetPageNRelease(page);
    PdfArray annotsArray = pageDictionary.GetAsArray(PdfName.ANNOTS);
    if (annotsArray == null || annotsArray.IsEmpty())
    {
        Console.Write("  No annotations.
");
        continue;
    }
    foreach (PdfObject pdfObject in annotsArray)
    {
        PdfObject direct = PdfReader.GetPdfObject(pdfObject);
        if (direct.IsDictionary())
        {
            PdfDictionary annotDictionary = (PdfDictionary)direct;
            Console.Write("  SubType: {0}
", annotDictionary.GetAsName(PdfName.SUBTYPE));
            PdfDictionary appearancesDictionary = annotDictionary.GetAsDict(PdfName.AP);
            if (appearancesDictionary == null)
            {
                Console.Write("    No appearances.
");
                continue;
            }
            foreach (PdfName key in appearancesDictionary.Keys)
            {
                Console.Write("    Appearance: {0}
", key);
                PdfStream value = appearancesDictionary.GetAsStream(key);
                if (value != null)
                {
                    String text = ExtractAnnotationText(value);
                    Console.Write("    Text:
---
{0}
---
", text);
                }
            }
        }
    }
}

with this helper method

public String ExtractAnnotationText(PdfStream xObject)
{
    PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();

    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
    processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
    return strategy.GetResultantText();
}

In case of the sample file above, the output of the code is

Page 1
  SubType: /FreeText
    Appearance: /N
    Text:
---
This is written using the 
typewriter tool.
---

Beware, there are some annotations, in particular widget annotations of checkboxes and radio buttons, which have a slightly deeper structure than expected by the code here.

Categories

c# - How to read text of appearance stream?

c# - How to read text of appearance stream?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags