-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Short description
I am trying to train Tesseract on Akkadian language. The language-specific.sh
script was modified accordingly. When converting the training text to TIFF images, the text2image
program crashes.
Environment
- Tesseract Version: 3.04.01
- Commit Number: the standard package in Ubuntu, package version 3.04.01-4, commit unknown
- Platform: Linux ubuntu-xenial 4.4.0-130-generic pdfrenderer: Fix uninitialized local variables #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
The environment was created using Vagrant. The commands are started on command line without GUI environment.
Running tesseract -v
produces following output:
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Current Behavior:
When running tesstrain.sh
with these command
./tesstrain.sh --lang akk --training_text corpus-12pt.txt --tessdata_dir /usr/share/tesseract-ocr/tessdata --langdata_dir ../langdata --fonts_dir /usr/share/fonts --fontlist "CuneiformNAOutline Medium" "CuneiformOB" --output_dir .
the text2image
crashes on every font with this message:
cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file stringrenderer.cpp, line 541
As a result, no box files are generated, so tesstrain.sh
exits with these messages:
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformOB.exp0.box does not exist or is not readable
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformNAOutline_Medium.exp0.box does not exist or is not readable
Expected Behavior:
tesstrain.sh
should create the box files and proceed with training.
Attachments:
I attached all files used: akktrain.zip.
The fonts are hosted here, but for the sake of completeness the .ttf
-files are included in the archive; they shoud be moved to /usr/share/fonts
.