Skip to content

email date format flexibility #4072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 13, 2025
Merged

email date format flexibility #4072

merged 19 commits into from
Aug 13, 2025

Conversation

potter-potter
Copy link
Contributor

we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe.

For some reason they have a date format that is not standard email format. But it is still legitimate.

This uses a more robust date package to parse the date. This package is already installed.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves email date parsing flexibility in the email partitioner to handle non-standard date formats that occur in .eml files, preventing downgrades to hi-res partitioning due to parsing failures.

  • Replaces email.utils.parsedate_to_datetime() with dateutil.parser.parse() for more robust date parsing
  • Adds exception handling to gracefully fall back when date parsing fails
  • Includes test coverage for both ISO-8601 and RFC-2822 date formats

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
unstructured/partition/email.py Updates date parsing logic to use dateutil.parser with fallback handling
test_unstructured/partition/test_email.py Adds parameterized tests for various date formats
example-docs/eml/test-iso-8601-date.eml Test file with ISO-8601 date format
example-docs/eml/test-rfc2822-date.eml Test file with RFC-2822 date format
CHANGELOG.md Documents the enhancement
unstructured/version.py Version bump to 0.18.13

self._attachment = attachment
self._ctx = ctx

@classmethod
def iter_elements(
cls, attachment: EmailMessage, ctx: EmailPartitioningContext
cls, attachment: MIMEPart, ctx: EmailPartitioningContext
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation change from EmailMessage to MIMEPart appears unrelated to the date parsing enhancement and could introduce breaking changes. This change should be verified to ensure it doesn't break existing functionality or should be moved to a separate PR.

Copilot uses AI. Check for mistakes.

self._attachment = attachment
self._ctx = ctx

@classmethod
def iter_elements(
cls, attachment: EmailMessage, ctx: EmailPartitioningContext
cls, attachment: MIMEPart, ctx: EmailPartitioningContext
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type annotation change from EmailMessage to MIMEPart is inconsistent with the PR's stated purpose of improving date parsing and could introduce breaking changes. The change should be justified or moved to a separate PR.

Copilot uses AI. Check for mistakes.

@potter-potter potter-potter marked this pull request as ready for review August 9, 2025 00:06
@potter-potter
Copy link
Contributor Author

@claude please review

Copy link

claude bot commented Aug 9, 2025

Claude finished @potter-potter's task —— View job


PR Review: Email Date Format Flexibility

Todo List:

  • Read and analyze the main implementation in unstructured/partition/email.py
  • Review test changes in test_unstructured/partition/test_email.py
  • Check the example .eml test files
  • Review changelog and version changes
  • Provide comprehensive feedback on code quality, security, and best practices

Summary

This PR successfully addresses the issue of email date parsing failures when processing .eml files with non-standard date formats. The change from email.utils.parsedate_to_datetime() to dateutil.parser.parse() is well-implemented with proper error handling.

Code Review Analysis

Main Implementation Changes (unstructured/partition/email.py:284-289):

try:
    sent_date = parser.parse(date_str)
except (parser.ParserError, TypeError, ValueError):
    return None

Key Benefits:

  • Robust parsing: dateutil.parser handles many more date formats including ISO-8601 with Z suffixes
  • Graceful fallback: Proper exception handling prevents crashes
  • Maintains behavior: Still normalizes to UTC and returns ISO format string
  • Good test coverage: Tests both ISO-8601 and RFC-2822 formats

Technical Assessment

Security: ✅ No concerns - uses established library safely
Performance: ✅ Minimal impact - dateutil.parser is efficient
Compatibility: ✅ Maintains existing API and return format
Error Handling: ✅ Comprehensive exception catching

Test Quality

The parameterized tests effectively cover:

  • ISO-8601 format: 2025-07-29T12:42:06.000Z
  • RFC-2822 format: Tue, 29 Jul 2025 12:42:06 +0000
  • Both normalize correctly to: 2025-07-29T12:42:06+00:00

Minor Recommendations

  1. Import organization: Consider grouping the dateutil import with other third-party imports
  2. Edge case testing: Could add tests for malformed dates that should fail gracefully

Overall Assessment

✅ APPROVED - This is a solid enhancement that solves a real problem. The implementation is correct, well-tested, and maintains backward compatibility while adding robustness for non-standard date formats.

The approach of using dateutil.parser is the right solution for handling varied email date formats in the wild. The change prevents VLM partitioner downgrades caused by date parsing failures.


try:
sent_date = parser.parse(date_str)
except (parser.ParserError, TypeError, ValueError):
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test for this route?

@potter-potter potter-potter added this pull request to the merge queue Aug 13, 2025
Merged via the queue into main with commit 0d20f6a Aug 13, 2025
36 checks passed
@potter-potter potter-potter deleted the potter/email-date-format branch August 13, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants