-
Notifications
You must be signed in to change notification settings - Fork 1k
email date format flexibility #4072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves email date parsing flexibility in the email partitioner to handle non-standard date formats that occur in .eml files, preventing downgrades to hi-res partitioning due to parsing failures.
- Replaces
email.utils.parsedate_to_datetime()
withdateutil.parser.parse()
for more robust date parsing - Adds exception handling to gracefully fall back when date parsing fails
- Includes test coverage for both ISO-8601 and RFC-2822 date formats
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
unstructured/partition/email.py | Updates date parsing logic to use dateutil.parser with fallback handling |
test_unstructured/partition/test_email.py | Adds parameterized tests for various date formats |
example-docs/eml/test-iso-8601-date.eml | Test file with ISO-8601 date format |
example-docs/eml/test-rfc2822-date.eml | Test file with RFC-2822 date format |
CHANGELOG.md | Documents the enhancement |
unstructured/version.py | Version bump to 0.18.13 |
unstructured/partition/email.py
Outdated
self._attachment = attachment | ||
self._ctx = ctx | ||
|
||
@classmethod | ||
def iter_elements( | ||
cls, attachment: EmailMessage, ctx: EmailPartitioningContext | ||
cls, attachment: MIMEPart, ctx: EmailPartitioningContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation change from EmailMessage
to MIMEPart
appears unrelated to the date parsing enhancement and could introduce breaking changes. This change should be verified to ensure it doesn't break existing functionality or should be moved to a separate PR.
Copilot uses AI. Check for mistakes.
unstructured/partition/email.py
Outdated
self._attachment = attachment | ||
self._ctx = ctx | ||
|
||
@classmethod | ||
def iter_elements( | ||
cls, attachment: EmailMessage, ctx: EmailPartitioningContext | ||
cls, attachment: MIMEPart, ctx: EmailPartitioningContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type annotation change from EmailMessage
to MIMEPart
is inconsistent with the PR's stated purpose of improving date parsing and could introduce breaking changes. The change should be justified or moved to a separate PR.
Copilot uses AI. Check for mistakes.
@claude please review |
Claude finished @potter-potter's task —— View job PR Review: Email Date Format FlexibilityTodo List:
SummaryThis PR successfully addresses the issue of email date parsing failures when processing .eml files with non-standard date formats. The change from Code Review AnalysisMain Implementation Changes (unstructured/partition/email.py:284-289): try:
sent_date = parser.parse(date_str)
except (parser.ParserError, TypeError, ValueError):
return None Key Benefits:
Technical AssessmentSecurity: ✅ No concerns - uses established library safely Test QualityThe parameterized tests effectively cover:
Minor Recommendations
Overall Assessment✅ APPROVED - This is a solid enhancement that solves a real problem. The implementation is correct, well-tested, and maintains backward compatibility while adding robustness for non-standard date formats. The approach of using |
try: | ||
sent_date = parser.parse(date_str) | ||
except (parser.ParserError, TypeError, ValueError): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a test for this route?
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: potter-potter <[email protected]>
This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: potter-potter <[email protected]>
we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe.
For some reason they have a date format that is not standard email format. But it is still legitimate.
This uses a more robust date package to parse the date. This package is already installed.