Skip to content

tesstrain.sh script exits with error #1781

@wincentbalin

Description

@wincentbalin

Short description

I am trying to train Tesseract on Akkadian language. The language-specific.sh script was modified accordingly. When converting the training text to TIFF images, the text2image program crashes.

Environment

  • Tesseract Version: 3.04.01
  • Commit Number: the standard package in Ubuntu, package version 3.04.01-4, commit unknown
  • Platform: Linux ubuntu-xenial 4.4.0-130-generic pdfrenderer: Fix uninitialized local variables #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The environment was created using Vagrant. The commands are started on command line without GUI environment.

Running tesseract -v produces following output:

tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Current Behavior:

When running tesstrain.sh with these command

./tesstrain.sh --lang akk --training_text corpus-12pt.txt --tessdata_dir /usr/share/tesseract-ocr/tessdata --langdata_dir ../langdata --fonts_dir /usr/share/fonts --fontlist "CuneiformNAOutline Medium" "CuneiformOB" --output_dir .

the text2image crashes on every font with this message:

cluster_text.size() == start_byte_to_box.size():Error:Assert failed:in file stringrenderer.cpp, line 541

As a result, no box files are generated, so tesstrain.sh exits with these messages:

ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformOB.exp0.box does not exist or is not readable
ERROR: /tmp/tmp.XSb02nt10d/akk/akk.CuneiformNAOutline_Medium.exp0.box does not exist or is not readable

Expected Behavior:

tesstrain.sh should create the box files and proceed with training.

Attachments:

I attached all files used: akktrain.zip.

The fonts are hosted here, but for the sake of completeness the .ttf-files are included in the archive; they shoud be moved to /usr/share/fonts.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions