Skip to content

Broken EXIF crash the whole program #7668

@Seas0

Description

@Seas0

Describe the bug

When parsing this image in the ImageNet1K dataset, the datasets crashs whole training process just because unable to parse an invalid EXIF tag.
Image

Steps to reproduce the bug

Use the datasets.Image.decode_example method to decode the aforementioned image could reproduce the bug.
The decoding function will throw an unhandled exception at the image.getexif() method call due to invalid utf-8 stream in EXIF tags.

File lib/python3.12/site-packages/datasets/features/image.py:188, in Image.decode_example(self, value, token_per_repo_id)
    186     image = PIL.Image.open(BytesIO(bytes_))
    187 image.load()  # to avoid "Too many open files" errors
--> 188 if image.getexif().get(PIL.Image.ExifTags.Base.Orientation) is not None:
    189     image = PIL.ImageOps.exif_transpose(image)
    190 if self.mode and self.mode != image.mode:

File lib/python3.12/site-packages/PIL/Image.py:1542, in Image.getexif(self)
   1540 xmp_tags = self.info.get("XML:com.adobe.xmp")
   1541 if not xmp_tags and (xmp_tags := self.info.get("xmp")):
-> 1542     xmp_tags = xmp_tags.decode("utf-8")
   1543 if xmp_tags:
   1544     match = re.search(r'tiff:Orientation(="|>)([0-9])', xmp_tags)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 4312: invalid start byte

Expected behavior

The invalid EXIF tag should simply be ignored or issue a warning message, instead of crash the whole program at once.

Environment info

  • datasets version: 3.6.0
  • Platform: Linux-6.5.0-18-generic-x86_64-with-glibc2.35
  • Python version: 3.12.11
  • huggingface_hub version: 0.33.0
  • PyArrow version: 20.0.0
  • Pandas version: 2.3.0
  • fsspec version: 2025.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions