-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Hi all, hope you have a joyful christmas time.
Tesseractversion: 4.* and 5.alpha.*
Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:
Result in box file:
B 210 18 218 48 0
i 210 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0
Same in hocr:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span> <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>
In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".
Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"
Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.
Thank you for all your work. Have a great christmas time and a happy new year!