-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Open
Description
Describe the bug
When parsing this image in the ImageNet1K dataset, the datasets
crashs whole training process just because unable to parse an invalid EXIF tag.
Steps to reproduce the bug
Use the datasets.Image.decode_example
method to decode the aforementioned image could reproduce the bug.
The decoding function will throw an unhandled exception at the image.getexif()
method call due to invalid utf-8 stream in EXIF tags.
File lib/python3.12/site-packages/datasets/features/image.py:188, in Image.decode_example(self, value, token_per_repo_id)
186 image = PIL.Image.open(BytesIO(bytes_))
187 image.load() # to avoid "Too many open files" errors
--> 188 if image.getexif().get(PIL.Image.ExifTags.Base.Orientation) is not None:
189 image = PIL.ImageOps.exif_transpose(image)
190 if self.mode and self.mode != image.mode:
File lib/python3.12/site-packages/PIL/Image.py:1542, in Image.getexif(self)
1540 xmp_tags = self.info.get("XML:com.adobe.xmp")
1541 if not xmp_tags and (xmp_tags := self.info.get("xmp")):
-> 1542 xmp_tags = xmp_tags.decode("utf-8")
1543 if xmp_tags:
1544 match = re.search(r'tiff:Orientation(="|>)([0-9])', xmp_tags)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 4312: invalid start byte
Expected behavior
The invalid EXIF tag should simply be ignored or issue a warning message, instead of crash the whole program at once.
Environment info
datasets
version: 3.6.0- Platform: Linux-6.5.0-18-generic-x86_64-with-glibc2.35
- Python version: 3.12.11
huggingface_hub
version: 0.33.0- PyArrow version: 20.0.0
- Pandas version: 2.3.0
fsspec
version: 2025.3.0
Metadata
Metadata
Assignees
Labels
No labels