-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add ignore_decode_errors option to Image feature for robust decoding #7612 #7638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…uggingface#7612) This PR implements support for robust image decoding in the `Image` feature, as discussed in issue huggingface#7612. ## 🔧 What was added - A new boolean field: `ignore_decode_errors` (default: `False`) - If set to `True`, any exceptions during decoding will be caught, and `None` will be returned instead of raising an error ```python features = Features({ "image": Image(decode=True, ignore_decode_errors=True), }) ```` This enables robust iteration over potentially corrupted datasets — especially useful when streaming datasets like WebDataset or image-heavy public sets where sample corruption is common. ## 🧪 Behavior * If `ignore_decode_errors=False` (default), decoding behaves exactly as before * If `True`, decoding errors are caught, and a warning is emitted: ``` [Image.decode_example] Skipped corrupted image: ... ``` ## 🧵 Linked issue Closes huggingface#7612 Let me know if you'd like a follow-up test PR. Happy to write one!
cc @lhoestq |
I think splitting the error handling for the main image decoding process and the metadata decoding process is possibly a bit nicer, as some images do render correctly, but their metadata might be invalid and cause the pipeline to fail, which I've encountered recently as in #7668. The |
@lhoestq & @Seas0 — that makes total sense. Currently, if EXIF metadata like To address this, I'm planning to split the EXIF handling into a separate try:
exif = image.getexif()
if exif.get(PIL.Image.ExifTags.Base.Orientation) is not None:
image = PIL.ImageOps.exif_transpose(image)
except Exception as exif_err:
if self.ignore_decode_errors:
warnings.warn(f"[Image.decode_example] Skipped EXIF metadata: {exif_err}")
else:
raise So that, Valid but EXIF-broken images will still be returned & EXIF failures will be skipped only if ignore_decode_errors=True. Sounds good?? |
…l image decoding This commit extends the `ignore_decode_errors=True` behavior in the `Image` feature to separately handle failures in EXIF metadata decoding (e.g., `.getexif()` errors). What was added: - `image.getexif()` and EXIF orientation correction (`ImageOps.exif_transpose`) are now wrapped in a separate try/except block. - If EXIF metadata is malformed (e.g., invalid UTF-8), it will be skipped gracefully *only if* `ignore_decode_errors=True`. - A warning is logged: `[Image.decode_example] Skipped EXIF metadata: ...` - The image will still be returned and used if valid. This change ensures that otherwise-decodable images are not discarded solely due to corrupt metadata. Issues addressed: - Closes huggingface#7612 — Enables robust streaming over corrupted image samples - Fully satisfies huggingface#7632 — Allows casting image columns without halting on invalid data - Resolves huggingface#7668 — Avoids crash on malformed EXIF while retaining the image Backward compatibility: - Existing behavior remains unchanged when `ignore_decode_errors=False` (default) - Only opt-in users will see this behavior
With the recent EXIF decoding isolation logic added, this PR now fully addresses:
All decoding errors (including |
This PR implements support for robust image decoding in the
Image
feature, as discussed in issue #7612.🔧 What was added
ignore_decode_errors
(default:False
)True
, any exceptions during decoding will be caught, andNone
will be returned instead of raising an errorThis enables robust iteration over potentially corrupted datasets — especially useful when streaming datasets like WebDataset or image-heavy public sets where sample corruption is common.
🧪 Behavior
If
ignore_decode_errors=False
(default), decoding behaves exactly as beforeIf
True
, decoding errors are caught, and a warning is emitted:🧵 Linked issue
Closes #7612
Let me know if you'd like a follow-up test PR. Happy to write one!