-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Open
Description
I am using the Mannheim windows build:
tesseract v5.0.0-alpha.20210506
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
When I process this command:
tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1
the resulting HOCR contains this
<span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683 384; x_wconf 84'>
<span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376; x_conf 98.908447'>P</span>
<span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376; x_conf 99.026512'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376; x_conf 98.80246'>r</span>
<span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384; x_conf 98.968414'>t</span>
<span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384; x_conf 98.820137'>y</span>
<span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376; x_conf 97.777733'>A</span>
</span>
which indicates incorrectly that the "y" overlaps the "t" and the "A", which I presume is why the "Party A" becomes one word instead of two.