Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

MSY99 · 2025-12-11T09:21:25Z

MSY99
Dec 11, 2025

❗ Filtering Word-Generated XObject Artifacts in DOCX PDFs

Hi,
When extracting images from PDFs using PyMuPDF, I am seeing a large number of unwanted XObjects in PDFs generated from DOCX files—especially when the document contains shapes, charts, or WordArt.

These DOCX PDFs often include:

monochrome or page-sized background bitmaps
repeated mask objects (has-mask = true)
fallback rasterizations of vector elements

Even if the page visually contains only one actual image, page.get_images(full=True) may return many XObjects.

📌 What I’m trying to understand

1. Which XObject properties are most reliable for identifying Word-generated artifacts?

Examples I am examining:

repeated digest
very large or very small bounding boxes
color space (DeviceGray vs. DeviceRGB)
Filters (DCTDecode / Flate / JPX)
mask usage (has-mask = true)
transform matrix patterns

Is there a commonly recommended approach in PyMuPDF to distinguish:

real content images vs. layout/background fallback images
generated by Word?

📌 Minimal test code

def filter_images_with_debug(page, doc, min_width=30, min_height=30, exclude_grayscale=True):
    images = page.get_images(full=True)
    filtered = []

    for img in images:
        xref = img[0]

        # Show image_info
        try:
            info = page.get_image_info(xref)
            if info:
                for meta in info:
                    print("bbox:", meta.get("bbox"))
        except Exception as e:
            print("image_info error:", e)

        # Extract raster
        try:
            base = doc.extract_image(xref)
            if not base or "image" not in base:
                continue

            pil = Image.open(BytesIO(base["image"]))

            if pil.width < min_width or pil.height < min_height:
                continue
            if exclude_grayscale and is_grayscale_image(pil):
                continue

            filtered.append({
                "xref": xref,
                "width": pil.width,
                "height": pil.height,
                "format": base.get("ext", "png")
            })

        except Exception as e:
            print("extract error:", e)

    return filtered

📌 Question for the PyMuPDF team

Are there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?

Or alternatively:

Is operator-level inspection (`Do`, `BI`, `ID`) the recommended strategy for this case?

Any guidance would be greatly appreciated.

JorjMcKie · 2025-12-12T08:18:56Z

JorjMcKie
Dec 12, 2025
Maintainer

Sorry I'm afraid we are no wiser than you here.
What I've observed though is that there are significant differences in internal PDF structures depending on the export method Word -> PDF: Using LibreOffice will create a very different internal PDF structure compared to Word itself, which in turn is very different from the results of every Print-To-PDF software.

If you have access to the original Office files, try to stick with to one software doing the import before you invest too much effort here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Filtering Word-Generated XObject Artifacts in DOCX PDFs #4837

Uh oh!

MSY99 Dec 11, 2025

❗ Filtering Word-Generated XObject Artifacts in DOCX PDFs

📌 What I’m trying to understand

1. Which XObject properties are most reliable for identifying Word-generated artifacts?

📌 Minimal test code

📌 Question for the PyMuPDF team

Are there specific XObject dictionary keys or structural PDF properties that reliably distinguish Word-generated artifacts from meaningful images?

Is operator-level inspection (Do, BI, ID) the recommended strategy for this case?

Replies: 1 comment

Uh oh!

JorjMcKie Dec 12, 2025 Maintainer

MSY99
Dec 11, 2025

Is operator-level inspection (`Do`, `BI`, `ID`) the recommended strategy for this case?

JorjMcKie
Dec 12, 2025
Maintainer