Skip to content

Commit 0d20f6a

Browse files
email date format flexibility (#4072)
we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe. For some reason they have a date format that is not standard email format. But it is still legitimate. This uses a more robust date package to parse the date. This package is already installed. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: potter-potter <[email protected]>
1 parent b8c14a7 commit 0d20f6a

20 files changed

+99
-5
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## 0.18.13
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
- **Parse a wider variety of date formats in email headers** The `partition_email` function is now more robust to non-standard date formats, including ISO-8601 dates with "Z" suffixes. This prevents `ValueError` exceptions when partitioning emails with these date formats.
10+
111
## 0.18.12
212

313
### Enhancements
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Date: INVALID-DATE-FORMAT
2+
3+
4+
Subject: Test invalid date format
5+
6+
This is a test-email with an invalid date format.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Date: 2025-07-29T12:42:06.000Z
2+
3+
4+
Subject: Test a Z-suffix date
5+
6+
This is a test-email.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Date: Tue, 29 Jul 2025 12:42:06 +0000
2+
3+
4+
Subject: Test a standard RFC-2822 date
5+
6+
This is a test-email.

scripts/docker-smoke-test.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# Start the containerized repository and run ingest tests
44

55
# shellcheck disable=SC2317 # Shellcheck complains that trap functions are unreachable...
6+
# shellcheck disable=SC2329 # Functions are invoked indirectly
67

78
set -eux -o pipefail
89

test_unstructured/partition/test_email.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -521,10 +521,29 @@ def it_uses_the_metadata_last_modified_arg_value_when_one_was_provided(self):
521521
ctx = EmailPartitioningContext(metadata_last_modified=metadata_last_modified)
522522
assert ctx.metadata_last_modified == metadata_last_modified
523523

524-
def and_it_uses_the_msg_Date_header_date_when_metadata_last_modified_was_not_provided(self):
524+
def and_it_uses_the_msg_Date_header_date_when_metadata_last_modified_was_not_provided(
525+
self,
526+
):
525527
ctx = EmailPartitioningContext(example_doc_path("eml/simple-rfc-822.eml"))
526528
assert ctx.metadata_last_modified == "2024-10-01T17:34:56+00:00"
527529

530+
@pytest.mark.parametrize(
531+
("date_format", "expected_date"),
532+
[
533+
("test-iso-8601-date.eml", "2025-07-29T12:42:06+00:00"),
534+
("test-rfc2822-date.eml", "2025-07-29T12:42:06+00:00"),
535+
],
536+
)
537+
def and_it_correctly_parses_various_date_formats_like_the_ones_that_occur_in_the_wild(
538+
self, date_format: str, expected_date: str
539+
):
540+
ctx = EmailPartitioningContext(example_doc_path(f"eml/{date_format}"))
541+
assert ctx.metadata_last_modified == expected_date
542+
543+
def and_it_returns_none_when_date_header_is_invalid(self):
544+
ctx = EmailPartitioningContext(example_doc_path("eml/test-invalid-date.eml"))
545+
assert ctx._sent_date is None
546+
528547
def and_it_falls_back_to_filesystem_last_modified_when_no_Date_header_is_present(
529548
self, get_last_modified_date_: Mock
530549
):

test_unstructured_ingest/src/azure.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
# shellcheck disable=SC2329 # Functions are invoked indirectly
4+
35
set -e
46

57
SRC_PATH=$(dirname "$(realpath "$0")")

test_unstructured_ingest/src/google-drive.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
# shellcheck disable=SC2329 # Functions are invoked indirectly
4+
35
set -e
46

57
SRC_PATH=$(dirname "$(realpath "$0")")

test_unstructured_ingest/src/kafka-local.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
# shellcheck disable=SC2329 # Functions are invoked indirectly
4+
35
set -e
46

57
SRC_PATH=$(dirname "$(realpath "$0")")

test_unstructured_ingest/src/local-single-file-basic-chunking.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
# shellcheck disable=SC2329 # Functions are invoked indirectly
4+
35
set -e
46

57
SRC_PATH=$(dirname "$(realpath "$0")")

0 commit comments

Comments
 (0)