Skip to content

Overlapping Character Boundingboxes #2825

@RicketyRick

Description

@RicketyRick

Hi all, hope you have a joyful christmas time.

Tesseractversion: 4.* and 5.alpha.*
Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:
billion

Result in box file:
B 210 18 218 48 0
i 210 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0

Same in hocr:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span> <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".
marked

Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"

Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.

Thank you for all your work. Have a great christmas time and a happy new year!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions