-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Environment
- Tesseract Version: 4.1.1 / 5.0.0 α
- Commit Number: 5.0.0-alpha-781-gb19e3ee
- Platform: Mac OS X 10.9.5 (not one of the 3 most recent versions, but I have no reason to believe that the issue is related to my OS)
tesseract --version
for both builds:
tesseract 4.1.1
leptonica-1.80.0
libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
tesseract 5.0.0-alpha-773-gd33ed
leptonica-1.80.0
libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found OpenMP 201307
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.72.0 OpenSSL/1.1.1g zlib/1.2.11 libidn2/2.3.0 libpsl/0.21.1 (+libidn2/2.3.0)
Current Behavior:
Character bounding boxes are unreliable, sometimes capturing (parts of) the previous character(s) or even missing its associated character completely.
Expected Behavior:
The bounding box should at least contain its associated character and overlap only in cases where the characters themselves overlap.
Suggested Fix:
Adjust how much bounding boxes can overlap. Maybe implement an option to force the x_min value of a character box to be no less than the x_max value of the previous bbox.
Details
I ran Tesseract 4.1.1 (on Mac OS X 10.9.5, installed through MacPorts) on a scanned page (grayscale JPEG) using the following command:
tesseract INPUT.jpg OUTPUT -c hocr_char_boxes=1 -c tessedit_create_hocr=1 -l nor --oem 1 makebox
My plan was to write a script that extracts individual characters and sorts them by symbol (for individual processing). I therefore hoped to make use of the new hOCR character bounding box support introduced in 4.1.0, but quickly ran into problems: while the OCR result itself was near perfect, Tesseract sometimes produced unexpected character bounding boxes.
To investigate the issue, I wrote a quick Python script that uses .box files produced by Tesseract to extract the individual characters and assemble an image strip with the OCRed character printed below the character bounding box.
Consider the following sample (A
):
Tesseract 4.1.1 produces the following .box file (truncated to the first three words):
n 50 151 65 167 0
ø 68 150 84 167 0
y 70 144 103 171 0
t 87 144 113 171 0
r 116 151 127 167 0
u 129 151 145 167 0
m 150 151 175 167 0
- 191 151 200 172 0
e 191 158 199 161 0
t 201 151 226 172 0
e 239 151 251 175 0
l 239 151 253 167 0
. 256 151 270 175 0
n 288 151 304 167 0
ø 306 150 323 168 0
y 309 145 337 172 0
t 325 145 341 167 0
r 342 151 352 172 0
e 355 151 365 167 0
t 366 151 391 172 0
From this my script produced the following image:
There are several overlapping bboxes, some including (parts of) other characters and a few even missing their associated character completely.
Reading through the similar issue reports that I could find, I learned that the LSTM engine does not actually output bounding boxes, but rather a simple x coordinate per character and that Tesseract then tries to create a bounding box from it.
I assume that this explains why the bboxes sometimes extend past the character they belong to and in some cases even overlap with other bboxes. However, it do not see how that can make a bbox completely miss its associated character, even though it was correctly OCRed, like in these five cases:
When I extracted these three words and added some white background to produce this image … (B
):
… the results also changed slightly (once again I have drawn red rectangles around the cases where a bbox captures the wrong character):
I have no idea why – all images in this test are saved in .png format, so compression artifacts should not be an issue here.
When I tried the legacy engine the bboxes were correct, but the accuracy dropped. From what I read that is expected due to how the legacy engine works (I assume that it is based on matching individual characters).
Since the release of 4.1.1, some improvements seem to have been made, but I was unable to find anything specific in the commit history, so it might be random. I compiled the latest revision (version 5.0.0 alpha) and ran the same commands as above. This time the following two images were produced:
The bboxes are more accurate than with version 4.1.1, but there are still problems (and they are the same as with 4.1.1). Summary of the problems:
- When a character is affected, the error “accumulates” and subsequent characters are usually affected too. The algorithm will almost never “recover” before the word ends once it has started producing incorrect bboxes (exception: compare the last 4-5 characters in the two last images).
- The worst case is that a bbox captures the previous character, often perfectly – so far I have not seen any cases where the bbox of character
n
contains parts of charactern-2
. If it completely misses charactern
, it will capture all of charactern-1
and only that. - A chain of errors always ends on the (detected) word boundary.
- The last character bbox in a word always captures its associated character plus any leftovers from the previous character(s) if they were affected by the problems.
Based on this I assume that the engine is identifying words, not characters, and subsequently attempts to split each identified word into separate characters. It looks like Tesseract does not check if a calculated character bbox is overlapping with other bboxes, but perhaps it should (or at least have an option to)?