Unfortunately the OP has not provided a sample PDF. Considering his previous question, though, he is most likely interested in free text annotations. Thus, I use this example PDF here as example. It has one page with a typewriter free text annotation looking like this:
The OP asked
Other than /Contents and /RC, there doesn't seem to be anywhere in the annotation's data structure that stores content data. Where should I be looking for the appearance contents?
The major shortcoming of the OP's code is that he only considered the normal appearance as PdfDictionary
:
PdfIndirectReference ap = (PdfIndirectReference)apDict.Get(PdfName.N);
PdfDictionary apRefDict = (PdfDictionary)pdfController.pdfReader.GetPdfObject(ap.Number);
It actually is a PdfStream
, i.e. a dictionary with a data stream, and this data stream is where the appearance drawing instructions are located.
But even with this data stream at hand, it is not as simple as imagined by the OP:
PdfString str = dict.GetAsString(new PdfName("KeyForAppearanceText"));
Actually the text in the appearance stream can be drawn in pieces, e.g. in my sample file the stream data look like this:
0 w
131.2646 564.8243 180.008 30.984 re
n
q
1 0 0 1 0 0 cm
131.2646 564.8243 180.008 30.984 re
W
n
0 g
1 w
BT
/Cour 12 Tf
0 g
131.265 587.96 Td
(This ) Tj
35.999 0 Td
(is ) Tj
21.6 0 Td
(written ) Tj
57.599 0 Td
(using ) Tj
43.2 0 Td
(the ) Tj
-158.398 -16.3 Td
(typewriter ) Tj
79.199 0 Td
(tool.) Tj
ET
Q
Furthermore, the encoding does not need to be some standard encoding like here but can instead be defined for an embedded font on-the-fly.
Thus, one has to apply full-fledged text extraction.
This all can be implemented like this:
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
Console.Write("
Page {0}
", page);
PdfDictionary pageDictionary = pdfReader.GetPageNRelease(page);
PdfArray annotsArray = pageDictionary.GetAsArray(PdfName.ANNOTS);
if (annotsArray == null || annotsArray.IsEmpty())
{
Console.Write(" No annotations.
");
continue;
}
foreach (PdfObject pdfObject in annotsArray)
{
PdfObject direct = PdfReader.GetPdfObject(pdfObject);
if (direct.IsDictionary())
{
PdfDictionary annotDictionary = (PdfDictionary)direct;
Console.Write(" SubType: {0}
", annotDictionary.GetAsName(PdfName.SUBTYPE));
PdfDictionary appearancesDictionary = annotDictionary.GetAsDict(PdfName.AP);
if (appearancesDictionary == null)
{
Console.Write(" No appearances.
");
continue;
}
foreach (PdfName key in appearancesDictionary.Keys)
{
Console.Write(" Appearance: {0}
", key);
PdfStream value = appearancesDictionary.GetAsStream(key);
if (value != null)
{
String text = ExtractAnnotationText(value);
Console.Write(" Text:
---
{0}
---
", text);
}
}
}
}
}
with this helper method
public String ExtractAnnotationText(PdfStream xObject)
{
PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
return strategy.GetResultantText();
}
In case of the sample file above, the output of the code is
Page 1
SubType: /FreeText
Appearance: /N
Text:
---
This is written using the
typewriter tool.
---
Beware, there are some annotations, in particular widget annotations of checkboxes and radio buttons, which have a slightly deeper structure than expected by the code here.