Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
320 views
in Technique[技术] by (71.8m points)

string - How to search for Particular Line Contents in PDF and Make that Line Marked In Color using Itext in c#

I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color". Here is my code in c#..

                ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

                if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
                {
                    int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
                    int endPosition = currentText.IndexOf("Total");

                    string result = currentText.Substring(startPosition, endPosition - startPosition);
                    // result will contain everything from and up to the Total line

                    using (StringReader reader = new StringReader(result))
                    {
                        // Loop over the lines in the string.
                                string[] split = line.Split(new Char[] { ' ' });

                    }
                }

If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Please read the documentation before posting semi-duplicate questions, such as:

You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).

In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.

The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:

Hello World

In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):

ld Wor llo He

The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!

When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).

<>
<<ld><Wor><llo><He>>
<<Hello People>>

This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.

The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.

When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.

<<ld><Wor><llo><He>>

results in:

<<He><llo><Wor><ld>>

Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.

You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...