Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837
Unanswered
MSY99
asked this question in
Looking for help
Replies: 1 comment
-
|
Sorry I'm afraid we are no wiser than you here. If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
❗ Filtering Word-Generated XObject Artifacts in DOCX PDFs
Hi,
When extracting images from PDFs using PyMuPDF, I am seeing a large number of unwanted XObjects in PDFs generated from DOCX files—especially when the document contains shapes, charts, or WordArt.
These DOCX PDFs often include:
has-mask = true)Even if the page visually contains only one actual image,
page.get_images(full=True)may return many XObjects.📌 What I’m trying to understand
1. Which XObject properties are most reliable for identifying Word-generated artifacts?
Examples I am examining:
has-mask = true)Is there a commonly recommended approach in PyMuPDF to distinguish:
real content images vs. layout/background fallback images
generated by Word?
📌 Minimal test code
📌 Question for the PyMuPDF team
Are there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?
Or alternatively:
Is operator-level inspection (
Do,BI,ID) the recommended strategy for this case?Any guidance would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions