Skip to content

email date format flexibility #4072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.18.13

### Enhancements

### Features

### Fixes

- **Parse a wider variety of date formats in email headers** The `partition_email` function is now more robust to non-standard date formats, including ISO-8601 dates with "Z" suffixes. This prevents `ValueError` exceptions when partitioning emails with these date formats.

## 0.18.12

### Enhancements
Expand Down
6 changes: 6 additions & 0 deletions example-docs/eml/test-invalid-date.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Date: INVALID-DATE-FORMAT
From: [email protected]
To: [email protected]
Subject: Test invalid date format

This is a test-email with an invalid date format.
6 changes: 6 additions & 0 deletions example-docs/eml/test-iso-8601-date.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Date: 2025-07-29T12:42:06.000Z
From: [email protected]
To: [email protected]
Subject: Test a Z-suffix date

This is a test-email.
6 changes: 6 additions & 0 deletions example-docs/eml/test-rfc2822-date.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Date: Tue, 29 Jul 2025 12:42:06 +0000
From: [email protected]
To: [email protected]
Subject: Test a standard RFC-2822 date

This is a test-email.
1 change: 1 addition & 0 deletions scripts/docker-smoke-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Start the containerized repository and run ingest tests

# shellcheck disable=SC2317 # Shellcheck complains that trap functions are unreachable...
# shellcheck disable=SC2329 # Functions are invoked indirectly

set -eux -o pipefail

Expand Down
21 changes: 20 additions & 1 deletion test_unstructured/partition/test_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -521,10 +521,29 @@ def it_uses_the_metadata_last_modified_arg_value_when_one_was_provided(self):
ctx = EmailPartitioningContext(metadata_last_modified=metadata_last_modified)
assert ctx.metadata_last_modified == metadata_last_modified

def and_it_uses_the_msg_Date_header_date_when_metadata_last_modified_was_not_provided(self):
def and_it_uses_the_msg_Date_header_date_when_metadata_last_modified_was_not_provided(
self,
):
ctx = EmailPartitioningContext(example_doc_path("eml/simple-rfc-822.eml"))
assert ctx.metadata_last_modified == "2024-10-01T17:34:56+00:00"

@pytest.mark.parametrize(
("date_format", "expected_date"),
[
("test-iso-8601-date.eml", "2025-07-29T12:42:06+00:00"),
("test-rfc2822-date.eml", "2025-07-29T12:42:06+00:00"),
],
)
def and_it_correctly_parses_various_date_formats_like_the_ones_that_occur_in_the_wild(
self, date_format: str, expected_date: str
):
ctx = EmailPartitioningContext(example_doc_path(f"eml/{date_format}"))
assert ctx.metadata_last_modified == expected_date

def and_it_returns_none_when_date_header_is_invalid(self):
ctx = EmailPartitioningContext(example_doc_path("eml/test-invalid-date.eml"))
assert ctx._sent_date is None

def and_it_falls_back_to_filesystem_last_modified_when_no_Date_header_is_present(
self, get_last_modified_date_: Mock
):
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/azure.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/google-drive.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/kafka-local.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
# option which otherwise has no other coverage.
# ------------------------------------------------------------------------------------------------

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

# -- Test Parameters: These vary by test file, others are common computed values --
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/local-single-file.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
17 changes: 17 additions & 0 deletions test_unstructured_ingest/src/pdf-fast-reprocess.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,21 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
local \
--output-dir "$OUTPUT_DIR"

# Flatten outputs so paths match fixtures. New behavior for downloads in unstructured-ingest is to create a nested directory structure.
mkdir -p "$OUTPUT_DIR/azure"
find "$OUTPUT_DIR/azure" -type f -name '*.json' -path '*/unstructured_*/*' -print0 | while IFS= read -r -d '' f; do
mv "$f" "$OUTPUT_DIR/azure/$(basename "$f")"
done
find "$OUTPUT_DIR/azure" -type d -name 'unstructured_*' -exec rm -rf {} +

# Normalize record_locator.path to drop unstructured_* in the download path
python3 - "$OUTPUT_DIR/azure" <<'PY'
import re, sys, pathlib
root = pathlib.Path(sys.argv[1])
for p in root.rglob('*.json'):
s = p.read_text()
s2 = re.sub(r'(/download/azure)/unstructured_[^/]+/', r'\1/', s)
if s2 != s:
p.write_text(s2)
PY
"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
7 changes: 5 additions & 2 deletions test_unstructured_ingest/src/s3-minio.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down Expand Up @@ -33,8 +35,7 @@ scripts/minio-test-helpers/create-and-check-minio.sh
wait

RUN_SCRIPT=${RUN_SCRIPT:-unstructured-ingest}
AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
s3 \
--num-processes "$max_processes" \
--download-dir "$DOWNLOAD_DIR" \
Expand All @@ -45,6 +46,8 @@ AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
--verbose \
--remote-url s3://utic-dev-tech-fixtures/ \
--endpoint-url http://localhost:9000 \
--key "$access_key" \
--secret "$secret_key" \
--work-dir "$WORK_DIR" \
local \
--output-dir "$OUTPUT_DIR"
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/s3.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
2 changes: 2 additions & 0 deletions test_unstructured_ingest/src/sharepoint.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/usr/bin/env bash

# shellcheck disable=SC2329 # Functions are invoked indirectly

set -e

SRC_PATH=$(dirname "$(realpath "$0")")
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.18.12" # pragma: no cover
__version__ = "0.18.13" # pragma: no cover
8 changes: 7 additions & 1 deletion unstructured/partition/email.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
from email.message import EmailMessage, MIMEPart
from typing import IO, Any, Final, Iterator, cast

from dateutil import parser

from unstructured.documents.elements import Element, ElementMetadata
from unstructured.file_utils.model import FileType
from unstructured.partition.common import UnsupportedFileFormatError
Expand Down Expand Up @@ -279,7 +281,11 @@ def _sent_date(self) -> str | None:
date_str = self.msg.get("Date")
if not date_str:
return None
sent_date = email.utils.parsedate_to_datetime(date_str)
try:
sent_date = parser.parse(date_str)
except (parser.ParserError, TypeError, ValueError):
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test for this route?


return sent_date.astimezone(dt.timezone.utc).isoformat(timespec="seconds")

def _validate(self) -> EmailPartitioningContext:
Expand Down