OCR - Digitalize Scans

Reads pdfs and images (jpg, png by default) to a text file.

Dependencies

sudo apt install tesseract-ocr tesseract-ocr-deu

Install python env:

poetry install

Convert pdfs and images to text files in the current directory:

poetry run digitize.py .

See digitize.py -h for more options.

Example:

poetry run ./digitize.py --exclude DSC IMAG foto picture photo book -r -- ~/sync/private/

You may exclude the generated files of pattern *_ocr.txt for sync.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
digitize.py		digitize.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml