Skip to content

Tesseract character spacing issue #3449

@youngsys

Description

@youngsys

I am using the Mannheim windows build:
tesseract v5.0.0-alpha.20210506
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

test

When I process this command:
tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1

the resulting HOCR contains this

  <span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683 384; x_wconf 84'>
   <span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376; x_conf 98.908447'>P</span>
   <span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376; x_conf 99.026512'>a</span>
   <span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376; x_conf 98.80246'>r</span>
   <span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384; x_conf 98.968414'>t</span>
   <span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384; x_conf 98.820137'>y</span>
   <span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376; x_conf 97.777733'>A</span>
  </span>

which indicates incorrectly that the "y" overlaps the "t" and the "A", which I presume is why the "Party A" becomes one word instead of two.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions