-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Open
Description
Environment
- Tesseract Version: tesseract 4.1.1-rc2-21-gf4ef
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 - Platform: Ubuntu 18.04 LTS
- Model Configs tested:
frk
,Fraktur
(fromtessdata_best
),gt4hist_5000k
(gt4hist-Model with 5000k Iterations)
Current Behavior:
When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.
When run with tesseract 0046.tif 0046 -l frk alto
it only alerts Empy Page!!
and exits in < 20 seconds.
0046-alto.zip
0046-tif.zip
Generated ALTO-File and TIF-Image included.
Expected Behavior:
Produce ALTO-XML with contents.
Suggested Fix:
No idea.