Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
636 views
in Technique[技术] by (71.8m points)

ocr - tesseract fails at simple number detection

I want to perform OCR on images like this one:

6x6 matrix with numerical values

It is a table with numerical data with colons as decimal separators. It is not noisy, contrast is good, black text on white background. As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract. I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:

Single cell from above table. Content: 1,7

Single cell from above table. Content: 57

Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.

There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.

Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".

Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".

Any suggestions what else might be helpful in improving the output quality of tesseract here?

My tesseract version:

tesseract v4.0.0.20190314
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
 Found AVX2
 Found AVX
 Found SSE
question from:https://stackoverflow.com/questions/65845004/tesseract-fails-at-simple-number-detection

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Solution


From the original image, we can see there are 5 different rows.

Each iteration, we will take a row, apply normalization and read.

We need to understand how to set image indexes carefully.

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]

start_index = 0
end_index = int(h/5)

Question Why do we declare start and end indexes?

We want to read a single row in each iteration. Lets see in an example:

The current image height and width are 645 and 1597 pixels.

Divide the images based on indexes:

start-index end-index
0 129
129 258 (129 + 129)
258 387 (258 + 129)
387 516 (387 + 129)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...