Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

c# - Itextsharp text extraction

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)

token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
    While token.NextToken()
        tknType = token.TokenType()
        tknValue = token.StringValue

I can meassure the length of the content but I cannot get the actual string content.

I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.

Now the question is, How can I extract text regardless of the font setting?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version

 public static string GetTextFromAllPages(String pdfPath)
    {
        PdfReader reader = new PdfReader(pdfPath); 

        StringWriter output = new StringWriter();  

        for (int i = 1; i <= reader.NumberOfPages; i++) 
            output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));

        return output.ToString();
    }

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...