-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Current Behavior
I clone tesseract git HEAD (currently d3e50cf) and configured and build it like this:
CXXFLAGS='-g -O0 -fno-omit-frame-pointer' ./configure --prefix=$PWD/installed --enable-debug --disable-legacy && make install
The next step is copying the attached gsta.traineddata.gz, decompressing it with gzip -vvd gsta.traineddata.gz
, and also placing the attached lorem-ipsum.png
below into the build directory.

Next, I run Tesseract like this:
TESSDATA_PREFIX=$PWD ./installed/bin/tesseract -l gsta lorem-ipsum.png out.txt && cat ~/src/mupdf/out.txt.txt
This produces the error messages:
Error: LSTM requested, but not present!! Loading tesseract.
Estimating resolution as 513
no best words!!
no best words!!
...
Segmentation fault
Re-running with valgrind:
TESSDATA_PREFIX=$PWD valgrind ./installed/bin/tesseract -l gsta lorem-ipsum.png out.txt && cat ~/src/mupdf/out.txt.txt
Produces a stacktrace showing the issue:
at 0x4B6785C: tesseract::WERD_CHOICE::permuter() const (ratngs.h:332)
by 0x4B63F8E: tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) (control.cpp:356)
by 0x4B0B7E9: tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) (baseapi.cpp:832)
by 0x4B0CD0C: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1217)
by 0x4B0CA95: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1180)
by 0x4B0BFF6: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:997)
by 0x114687: main1(int, char**) (tesseract.cpp:846)
by 0x1147DC: main (tesseract.cpp:858)
Address 0x8c is not stack'd, malloc'd or (recently) free'd
Re-running again in gdb and investigating the state at the segmentation fault:
TESSDATA_PREFIX=$PWD gdb --args ./installed/bin/tesseract -l gsta lorem-ipsum.png out.txt && cat ~/src/mupdf/out.txt.txt
... reveals that if (page_res_it.word()->best_choice->permuter() == USER_DAWG_PERM) {
is the culprit, specifically page_res_it.word()->best_choice
is NULL
(incidentally the same is true for page_res_it.word()->raw_choice
, but I don't know if that matters.
The problem is the crash, which appears to be due to dereferencing a NULL-pointer.
The gsta.traineddata
originated from [email protected] who reported a bug to us at MuPDF. As you can see I have managed to reproduce the problem without involving MuPDF, so I believe that this issue is best fixed in Tesseract.
Expected Behavior
No segmentation fault.
Suggested Fix
I will shortly provide a pull request where I attempt to fix this segmentation fault. I hope that the fix is correct and will provide useful to you. Yes, I am running ./configure && make check
before/after my proposed fix, and the result looks identical.
tesseract -v
tesseract 5.5.1-9-gd3e5
leptonica-1.82.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.2.13 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.1.5 zlib/1.3.1 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 libssh2/1.11.0 nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.5.13
Operating System
Debian 12 Bookworm
Other Operating System
No response
uname -a
Linux hal4001 6.6.15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04) x86_64 GNU/Linux
Compiler
gcc (Debian 13.2.0-25) 13.2.0
CPU
Intel Core i7-5600U CPU @ 2.60GHz
Virtualization / Containers
N/A
Other Information
I did my best to follow the contribution guidelines and fill in the bug report template, if I missed something please help me explain what I has seemingly missed. :)