Skip to content

Tesseract Empty Page #3021

@M3ssman

Description

@M3ssman

Environment

  • Tesseract Version: tesseract 4.1.1-rc2-21-gf4ef
    leptonica-1.78.0
    libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
    Found AVX2
    Found AVX
    Found FMA
    Found SSE
    Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
  • Platform: Ubuntu 18.04 LTS
  • Model Configs tested: frk, Fraktur (from tessdata_best), gt4hist_5000k (gt4hist-Model with 5000k Iterations)

Current Behavior:

When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.

When run with tesseract 0046.tif 0046 -l frk alto it only alerts Empy Page!! and exits in < 20 seconds.
0046-alto.zip
0046-tif.zip

Generated ALTO-File and TIF-Image included.

Expected Behavior:

Produce ALTO-XML with contents.

Suggested Fix:

No idea.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions