Incorrect character bounding boxes

### Environment

* **Tesseract Version**: 4.1.1 / 5.0.0 α
* **Commit Number**: 5.0.0-alpha-781-gb19e3ee
* **Platform**: Mac OS X 10.9.5 (not one of the 3 most recent versions, but I have no reason to believe that the issue is related to my OS)

`tesseract --version` for both builds:

```
tesseract 4.1.1
 leptonica-1.80.0
  libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
```

```
tesseract 5.0.0-alpha-773-gd33ed
 leptonica-1.80.0
  libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found OpenMP 201307
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
 Found libcurl/7.72.0 OpenSSL/1.1.1g zlib/1.2.11 libidn2/2.3.0 libpsl/0.21.1 (+libidn2/2.3.0)
```

### Current Behavior:

Character bounding boxes are unreliable, sometimes capturing (parts of) the previous character(s) or even missing its associated character completely.

### Expected Behavior:

The bounding box should at least contain its associated character and overlap only in cases where the characters themselves overlap.

### Suggested Fix:

Adjust how much bounding boxes can overlap. Maybe implement an option to force the x_min value of a character box to be no less than the x_max value of the previous bbox.

------------------------

### Details

I ran Tesseract 4.1.1 (on Mac OS X 10.9.5, installed through MacPorts) on a scanned page (grayscale JPEG) using the following command:

`tesseract INPUT.jpg OUTPUT -c hocr_char_boxes=1 -c tessedit_create_hocr=1 -l nor --oem 1 makebox`

My plan was to write a script that extracts individual characters and sorts them by symbol (for individual processing). I therefore hoped to make use of the new hOCR character bounding box support introduced in 4.1.0, but quickly ran into problems: while the OCR result itself was near perfect, Tesseract sometimes produced unexpected character bounding boxes. 

To investigate the issue, I wrote a quick Python script that uses .box files produced by Tesseract to extract the individual characters and assemble an image strip with the OCRed character printed below the character bounding box.

Consider the following sample (`A`):

<img width="446" alt="A" src="https://user-images.githubusercontent.com/5731506/94082809-1261a400-fe02-11ea-81c5-8b8cb7993e73.png">

Tesseract 4.1.1 produces the following .box file (truncated to the first three words):

```
n 50 151 65 167 0
ø 68 150 84 167 0
y 70 144 103 171 0
t 87 144 113 171 0
r 116 151 127 167 0
u 129 151 145 167 0
m 150 151 175 167 0
- 191 151 200 172 0
e 191 158 199 161 0
t 201 151 226 172 0
e 239 151 251 175 0
l 239 151 253 167 0
. 256 151 270 175 0
n 288 151 304 167 0
ø 306 150 323 168 0
y 309 145 337 172 0
t 325 145 341 167 0
r 342 151 352 172 0
e 355 151 365 167 0
t 366 151 391 172 0
```

From this my script produced the following image:

![strip_A](https://user-images.githubusercontent.com/5731506/94082825-1988b200-fe02-11ea-8ca8-706251e7a7e6.png)

There are several overlapping bboxes, some including (parts of) other characters and a few even missing their associated character completely.

Reading through the similar issue reports that I could find, I learned that the LSTM engine does not actually output bounding boxes, but rather a simple x coordinate per character and that Tesseract then tries to [create a bounding box from it](https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987).

I assume that this explains why the bboxes sometimes extend past the character they belong to and in some cases even overlap with other bboxes. However, it do not see how that can make a bbox completely miss its associated character, even though it was correctly OCRed, like in these five cases:

![strip_A+prob](https://user-images.githubusercontent.com/5731506/94082835-1e4d6600-fe02-11ea-9a53-2ed91583c5c7.png)

When I extracted these three words and added some white background to produce this image … (`B`):

<img width="300" alt="B" src="https://user-images.githubusercontent.com/5731506/94082851-22798380-fe02-11ea-8da8-457e8ed3f516.png">

… the results also changed slightly (once again I have drawn red rectangles around the cases where a bbox captures the wrong character):

![strip_B+prob](https://user-images.githubusercontent.com/5731506/94082862-26a5a100-fe02-11ea-9b99-cb200cf1ab59.png)

I have no idea why – all images in this test are saved in .png format, so compression artifacts should not be an issue here.

When I tried the legacy engine the bboxes were correct, but the accuracy dropped. From what I read that is expected due to how the legacy engine works (I assume that it is based on matching individual characters).

Since the release of 4.1.1, some improvements seem to have been made, but I was unable to find anything specific in the commit history, so it might be random. I compiled the latest revision (version 5.0.0 alpha) and ran the same commands as above. This time the following two images were produced:

For `A`:
![strip_A_n](https://user-images.githubusercontent.com/5731506/94082882-2ad1be80-fe02-11ea-9a9e-66aed00a3af2.png)

For `B`:
![strip_B_n](https://user-images.githubusercontent.com/5731506/94082884-2d341880-fe02-11ea-8e7d-e33ade02a0a2.png)

The bboxes are more accurate than with version 4.1.1, but there are still problems (and they are the same as with 4.1.1). Summary of the problems:
* When a character is affected, the error “accumulates” and subsequent characters are usually affected too. The algorithm will almost never “recover” before the word ends once it has started producing incorrect bboxes (exception: compare the last 4-5 characters in the two last images).
* The worst case is that a bbox captures the previous character, often perfectly – so far I have not seen any cases where the bbox of character `n` contains parts of character `n-2`. If it completely misses character `n`, it will capture all of character `n-1` and only that.
* A chain of errors always ends on the (detected) word boundary.
* The last character bbox in a word always captures its associated character plus any leftovers from the previous character(s) if they were affected by the problems.

Based on this I assume that the engine is identifying words, not characters, and subsequently attempts to split each identified word into separate characters. It looks like Tesseract does not check if a calculated character bbox is overlapping with other bboxes, but perhaps it should (or at least have an option to)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect character bounding boxes #3105

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect character bounding boxes #3105

Description

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions