Skip to content

Add ignore_decode_errors option to Image feature for robust decoding #7612 #7638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

This PR implements support for robust image decoding in the Image feature, as discussed in issue #7612.

🔧 What was added

  • A new boolean field: ignore_decode_errors (default: False)
  • If set to True, any exceptions during decoding will be caught, and None will be returned instead of raising an error
features = Features({
    "image": Image(decode=True, ignore_decode_errors=True),
})

This enables robust iteration over potentially corrupted datasets — especially useful when streaming datasets like WebDataset or image-heavy public sets where sample corruption is common.

🧪 Behavior

  • If ignore_decode_errors=False (default), decoding behaves exactly as before

  • If True, decoding errors are caught, and a warning is emitted:

    [Image.decode_example] Skipped corrupted image: ...
    

🧵 Linked issue

Closes #7612

Let me know if you'd like a follow-up test PR. Happy to write one!

…uggingface#7612)

This PR implements support for robust image decoding in the `Image` feature, as discussed in issue huggingface#7612.

## 🔧 What was added
- A new boolean field: `ignore_decode_errors` (default: `False`)
- If set to `True`, any exceptions during decoding will be caught, and `None` will be returned instead of raising an error

```python
features = Features({
    "image": Image(decode=True, ignore_decode_errors=True),
})
````

This enables robust iteration over potentially corrupted datasets — especially useful when streaming datasets like WebDataset or image-heavy public sets where sample corruption is common.

## 🧪 Behavior

* If `ignore_decode_errors=False` (default), decoding behaves exactly as before
* If `True`, decoding errors are caught, and a warning is emitted:

  ```
  [Image.decode_example] Skipped corrupted image: ...
  ```

## 🧵 Linked issue

Closes huggingface#7612

Let me know if you'd like a follow-up test PR. Happy to write one!
@ArjunJagdale
Copy link
Contributor Author

cc @lhoestq

@Seas0
Copy link

Seas0 commented Jul 3, 2025

I think splitting the error handling for the main image decoding process and the metadata decoding process is possibly a bit nicer, as some images do render correctly, but their metadata might be invalid and cause the pipeline to fail, which I've encountered recently as in #7668.

The decode_image function in torchvision handles similar cases by using the apply_exif_orientation flag to turn off the exif metadata processing entirely.

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 3, 2025

I think splitting the error handling for the main image decoding process and the metadata decoding process is possibly a bit nicer, as some images do render correctly, but their metadata might be invalid and cause the pipeline to fail, which I've encountered recently as in #7668.
The decode_image function in torchvision handles similar cases by using the apply_exif_orientation flag to turn off the exif metadata processing entirely.

@lhoestq & @Seas0 — that makes total sense.

Currently, if EXIF metadata like .getexif() fails (due to malformed tags), the whole image gets dropped even if it renders correctly — not ideal.

To address this, I'm planning to split the EXIF handling into a separate try/except block, like:

try:
    exif = image.getexif()
    if exif.get(PIL.Image.ExifTags.Base.Orientation) is not None:
        image = PIL.ImageOps.exif_transpose(image)
except Exception as exif_err:
    if self.ignore_decode_errors:
        warnings.warn(f"[Image.decode_example] Skipped EXIF metadata: {exif_err}")
    else:
        raise

So that, Valid but EXIF-broken images will still be returned & EXIF failures will be skipped only if ignore_decode_errors=True.

Sounds good??

…l image decoding

This commit extends the `ignore_decode_errors=True` behavior in the `Image` feature to separately handle failures in EXIF metadata decoding (e.g., `.getexif()` errors).

What was added:
- `image.getexif()` and EXIF orientation correction (`ImageOps.exif_transpose`) are now wrapped in a separate try/except block.
- If EXIF metadata is malformed (e.g., invalid UTF-8), it will be skipped gracefully *only if* `ignore_decode_errors=True`.
- A warning is logged: `[Image.decode_example] Skipped EXIF metadata: ...`
- The image will still be returned and used if valid.

This change ensures that otherwise-decodable images are not discarded solely due to corrupt metadata.

Issues addressed:
- Closes huggingface#7612 — Enables robust streaming over corrupted image samples
- Fully satisfies huggingface#7632 — Allows casting image columns without halting on invalid data
- Resolves huggingface#7668 — Avoids crash on malformed EXIF while retaining the image

Backward compatibility:
- Existing behavior remains unchanged when `ignore_decode_errors=False` (default)
- Only opt-in users will see this behavior
@ArjunJagdale
Copy link
Contributor Author

With the recent EXIF decoding isolation logic added, this PR now fully addresses:

All decoding errors (including .getexif() and image file loading) are now skipped with a warning when ignore_decode_errors=True. This enables safe, scalable image preprocessing pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide an option of robust dataset iterator with error handling
2 participants