Skip to content

Incorrectly parsing OCMDs leads to incomplete renderings of the page #1022

@ffc01

Description

@ffc01

Please provide all mandatory information!

Describe the bug (mandatory)

PyMuPDF is not extracting all the text that is on the pdf. There are parts that won't be recogized nor extracted. I have already checked if the not recoginzable text in the pdf is just a picture or something else other than a text but I came to the conclusion that the part that won't get extracted is a text because I can copy the text from the part and paste it when I open the pdf with a pdf-reader. The text/objects of that part are shown in Foxit-reader/Acrobat-Reader DC but not recoginzed by mupdf.

To Reproduce (mandatory)

I have executed this code:

import fitz

pdf_document = "mypdf.pdf"
doc = fitz.open(pdf_document)

page1 = doc.loadPage(0)

page1text = page1.getText("text", flags=0)

print(page1text)

(If you need the pdf and/or the output please tell me so i can send it to you via email or dm)

Expected behavior (optional)

All the text should have been extracted. But only a part of the pdf gets extracted (see screenshot below).

Screenshots (optional)

An example on what gets recognized/extracted (red box) in the pdf and the part that is not getting extracted/recoginzed/shown by mupdf (I can not publish the pdf):

mupdf

Your configuration (mandatory)

PyMuPDF 1.18.12: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on win32 (64-bit).

Additional context (optional)

I am not exactly sure why this is happening but I would guess that it could be because the text got defined on the wrong part in the pdf so the pdf has some issues?

Thanks in advance

Metadata

Metadata

Assignees

Labels

postponepostpone to a future versionupstream bugbug outside this package

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions