ocr - tesseract fails at simple number detection

Question

Welcome To Ask or Share your Answers For Others

ocr - tesseract fails at simple number detection

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

ocr - tesseract fails at simple number detection

I want to perform OCR on images like this one:

6x6 matrix with numerical values

It is a table with numerical data with colons as decimal separators. It is not noisy, contrast is good, black text on white background. As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract. I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:

Single cell from above table. Content: 1,7

Single cell from above table. Content: 57

Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.

There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.

Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".

Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".

Any suggestions what else might be helpful in improving the output quality of tesseract here?

My tesseract version:

tesseract v4.0.0.20190314
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
 Found AVX2
 Found AVX
 Found SSE

question from:https://stackoverflow.com/questions/65845004/tesseract-fails-at-simple-number-detection

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:30:45+0000

Solution

1. Divide the image into the 5-different row
1. Apply division-normalization to each row
1. Set psm to 6 (Assume a single uniform block of text.)
1. Read

From the original image, we can see there are 5 different rows.

Each iteration, we will take a row, apply normalization and read.

We need to understand how to set image indexes carefully.

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]

start_index = 0
end_index = int(h/5)

Question Why do we declare start and end indexes?

We want to read a single row in each iteration. Lets see in an example:

The current image height and width are 645 and 1597 pixels.

Divide the images based on indexes:

start-index	end-index
0	129
129	258 (129 + 129)
258	387 (258 + 129)
387	516 (387 + 129)

Categories

ocr - tesseract fails at simple number detection

ocr - tesseract fails at simple number detection

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags