Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

how to read pdf file with blank spaces (as it is) line by Line in c#.net using iTextsharp

I am using iText (for .net) to read pdf files. It reads the document but when there are whitespaces it reads only one space.

That makes it impossible to extract data by getting substrings. I want to read data line by line with whitespaces so I know the actual position of text because I want to write the data into a database.

The file is a bank statement, I want to dump it into a database for designing a reconciled system,

Here is a screen shot of a file file

Following is the code which I am using

            For page As Integer = 1 To pdfReader.NumberOfPages
            ' Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()

            Dim Strategy As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
            Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText)))


            Dim delimiterChars As Char() = {ControlChars.Lf}

            Dim lines As String() = currentText.Split(delimiterChars)

            Dim Bnk_Name As Boolean = True
            Dim Br_Name As Boolean = False
            Dim Name_acc As Boolean = False
            Dim statment As Boolean = False
            Dim Curr As Boolean = False
            Dim Open As Boolean = False
            Dim BankName = ""
            Dim Branch = ""
            Dim AccountNo = ""
            Dim CompName = ""
            Dim Currency = ""
            Dim Statement_from = ""
            Dim Statement_to = ""
            Dim Opening_Balance = ""
            Dim Closing_Balance = ""
            Dim Narration As String = ""
            For Each line As String In lines

                line.Trim()

                'BANK NAME
                If Bnk_Name Then
                    If line.Trim() <> "" Then
                        BankName = line.Substring(0, 21)
                        Bnk_Name = False
                    Else
                        Bnk_Name = False

                    End If
                End If

This Pic shows a sample that code read file

but I want as it is as whitespaces to read position

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

(Without seeing your PDF, this explanation is the best I can come up with.)

Your document does not contain any spaces. That is to say, the content streams of your document do not contain spaces. In stead, the instructions that render characters simply take into account the space that needs to be there.

In that case, iText has to "guess" where the spaces are. And it will estimate to insert 1 space every time two characters are further apart that the width of the whitespace character of the font that is being used.

Possibly that's where this is going wrong.

Equally important however, you should never use text positions to extract data. This approach is simply too error-prone.

Try using regular expressions combined with a better ITextExtractionStrategy. There is an implementation of ITextExtractionStrategy that allows you to specify a Rectangle. If you do it that way, you can get the content from your document in a much more precise way.

Since you're dealing with bank statements, it should be easy to extract content by using a combination of rectangle-based-search and regular expressions (e.g. looking for things matching bank-account numbers)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...