Overlapping Character Boundingboxes

Hi all, hope you have a joyful christmas time.

Tesseractversion: 4.* and 5.alpha.*
Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:
![billion](https://user-images.githubusercontent.com/58804365/71066544-1a096380-2173-11ea-9db9-c5f0620e0703.png)

Result in box file:
B **210** 18 218 48 0
i **210** 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0

Same in hocr:
`<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span>
<span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>`

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan". 
![marked](https://user-images.githubusercontent.com/58804365/71067058-3063ef00-2174-11ea-9014-99da838221a8.JPG)

Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"

Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.

Thank you for all your work. Have a great christmas time and a happy new year!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overlapping Character Boundingboxes #2825

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Overlapping Character Boundingboxes #2825

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions