-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Please provide all mandatory information!
Describe the bug (mandatory)
PyMuPDF is not extracting all the text that is on the pdf. There are parts that won't be recogized nor extracted. I have already checked if the not recoginzable text in the pdf is just a picture or something else other than a text but I came to the conclusion that the part that won't get extracted is a text because I can copy the text from the part and paste it when I open the pdf with a pdf-reader. The text/objects of that part are shown in Foxit-reader/Acrobat-Reader DC but not recoginzed by mupdf.
To Reproduce (mandatory)
I have executed this code:
import fitz
pdf_document = "mypdf.pdf"
doc = fitz.open(pdf_document)
page1 = doc.loadPage(0)
page1text = page1.getText("text", flags=0)
print(page1text)
(If you need the pdf and/or the output please tell me so i can send it to you via email or dm)
Expected behavior (optional)
All the text should have been extracted. But only a part of the pdf gets extracted (see screenshot below).
Screenshots (optional)
An example on what gets recognized/extracted (red box) in the pdf and the part that is not getting extracted/recoginzed/shown by mupdf (I can not publish the pdf):
Your configuration (mandatory)
PyMuPDF 1.18.12: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on win32 (64-bit).
Additional context (optional)
I am not exactly sure why this is happening but I would guess that it could be because the text got defined on the wrong part in the pdf so the pdf has some issues?
Thanks in advance
