Parser rfc 2822 memory optimization #204

romsahel · 2025-10-07T07:44:00Z

Problem Statement

We were experiencing heap exhaustion in the RFC2822 parser when processing large emails with attachments, ultimately triggering "maximum heap size reached" errors on our per-process memory-constrained system.

Memory profiling showed heap size jumping to 500+ MB when processing a 46MB eml.

Root Causes

1. Header Parsing Memory Accumulation

The original implementation used String.split(content, "\r\n") to parse the headers line by line. For large emails, this created massive arrays consuming significant heap space. Even after header extraction, the body lines remained in memory.

2. Body Parsing Memory Accumulation

After header extraction, the body was split into lines again. For multipart emails, each part was extracted as a full binary immediately, causing large attachments to be loaded entirely into memory for parsing. Recursive parsing of parts caused cascading memory growth.

Solution

1. Lazy Streaming

Process content incrementally without materializing full arrays. The new stream_extract_headers/1 function extracts headers line-by-line using Stream.unfold with :binary.split, accumulating only header lines.

2. Byte Offset Tracking

Instead of extracting content, we track where content is located:

stream_extract_headers/1 returns {headers, body_offset, has_separator} where body_offset is simply an integer indicating the byte position where the body starts
binary_part(content, body_offset, size) to extract only the body portion when needed
Similarly, extract_parts_ranges/2 returns [{offset1, size1}, {offset2, size2}, ...] instead of [part1_binary, part2_binary, ...]

3. Conditional Extraction

extract_parts_ranges/2 function identifies part boundaries first as lightweight {offset, size} tuples before extracting any content.

The parse function takes a :parts_handler_fn argument that allows to configure how subsequent parts are handled.
An example would be to parse only the headers of files that have a size greater than a given threshold:

          parts_handler_fn: fn part_info, message, _opts ->
            if part_info.size > 25 do
              Map.put(message, :body, "[Headers only - body skipped: #{part_info.size} bytes]")
            else
              :parse
            end
          end

Notes

The public API has not changed except for the new available options. Previous behavior is also preserved.

Phase 1 of a memory usage optimization: - Process content incrementally without materializing full arrays. The new stream_extract_headers/1 function extracts headers line-by-line using Stream.unfold with :binary.split, accumulating only header lines - byte offset tracking to extract only the body portion when needed

Phase 2 of a memory usage optimization extract_parts_ranges/2 function identifies part boundaries first as lightweight {offset, size} tuples before extracting any content. Only small parts (configurable threshold, default 10MB) are extracted and parsed; Large parts are replaced with placeholder messages.

bcardarella · 2025-10-07T08:49:09Z

Just curious what the context is for an email that large?

@andrewtimberlake I'm 👍 on this PR

For the next step I think a formal parser is the real solution here. The best solution would be in C to avoid all of the mem copy from pass by value function calls in Elixir. But if kept in Elixr then writing a straight up single-pass parser is do-able too and going to be far better than the original implementation I wrote.

romsahel · 2025-10-09T15:53:43Z

Thanks for the quick comment!

Just curious what the context is for an email that large?

Apparently, it’s common business practice to send emails with 40 MB+ PDF attachments 🤷

FYI: besides providing a more flexible way to handle large parts, I also plan on adding a public parse function that can take a Stream as input. On our end, that would let us stream the EML directly into the parser, so – I think – we’d never have to load the full EML into memory.

andrewtimberlake

Thanks for the work on this.

lib/mail/parsers/rfc_2822.ex

andrewtimberlake · 2025-10-10T11:46:12Z

lib/mail/parsers/rfc_2822.ex

    * `:charset_handler` - A function that takes a charset and binary and returns a binary. Defaults to return the string as is.
+    * `:header_only` - Whether to parse only the headers. Defaults to false.
+    * `:max_part_size` - The maximum size of a part in bytes. Defaults to 10MB.
+    * `:skip_large_parts?` - Whether to skip parts larger than `max_part_size`. Defaults to false.


We don’t need :skip_large_parts? as we can just use :max_part_size (if not nil, skip)

Right, following your comments I went ahead and made more progress on my more flexible approach.
We now have a parts_handler_fn that allows for more deciding factors (size-based, content-type-based, etc.)
See the doc:
https://github.com/romsahel/elixir-mail/blob/9acedefdac663ffa2d72f57138ad669a8806e790/lib/mail/parsers/rfc_2822.ex#L48-L68

@doc """ * `:parts_handler_fn` - A function invoked for each multipart message part. Receives `part_info`, `message` (with parsed headers), and `opts`. Defaults to nil (normal parsing). ### Handler Return Values * `:parse` - Continues with normal parsing of the part's body * `%Mail.Message{}` - Returns a custom message structure (headers are already parsed, you provide the body)

andrewtimberlake · 2025-10-10T11:47:15Z

lib/mail/parsers/rfc_2822.ex

+      |> Enum.map(fn {start, size} ->
+        if skip_large_parts? and size > size_threshold do
+          # Don't extract or parse large parts: return placeholder
+          %Mail.Message{body: "[Part skipped: #{size} bytes - too large to parse]"}


I don’t love this. I had a look at your draft and I think a way to handle these parts is important

andrewtimberlake · 2025-10-10T11:47:51Z

test/mail/parsers/rfc_2822_test.exs

+
+      assert message.body == nil
+
+      [text_part, html_part, headers_only_part] = message.parts


We should be explicit in testing what a skipped part returns.

Replaces max_part_size and skip_large_parts? with a more flexible callback-based approach. The parts_handler_fn receives part metadata and parsed headers, allowing conditional parsing based on size, content-type, or any other criteria.

romsahel added 2 commits October 5, 2025 22:17

romsahel marked this pull request as ready for review October 7, 2025 07:46

andrewtimberlake requested changes Oct 10, 2025

View reviewed changes

romsahel force-pushed the parser-rfc-2822-memory-optimization branch 2 times, most recently from 704ac07 to 9acedef Compare October 12, 2025 21:35

romsahel requested a review from andrewtimberlake October 12, 2025 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parser rfc 2822 memory optimization #204

Parser rfc 2822 memory optimization #204

Uh oh!

romsahel commented Oct 7, 2025 •

edited

Loading

Uh oh!

bcardarella commented Oct 7, 2025

Uh oh!

romsahel commented Oct 9, 2025 •

edited

Loading

Uh oh!

andrewtimberlake left a comment

Uh oh!

Uh oh!

andrewtimberlake Oct 10, 2025

Uh oh!

romsahel Oct 12, 2025 •

edited

Loading

Uh oh!

andrewtimberlake Oct 10, 2025

Uh oh!

andrewtimberlake Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		assert message.body == nil

		[text_part, html_part, headers_only_part] = message.parts

Uh oh!

Parser rfc 2822 memory optimization #204

Are you sure you want to change the base?

Parser rfc 2822 memory optimization #204

Uh oh!

Conversation

romsahel commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Root Causes

1. Header Parsing Memory Accumulation

2. Body Parsing Memory Accumulation

Solution

1. Lazy Streaming

2. Byte Offset Tracking

3. Conditional Extraction

Notes

Uh oh!

bcardarella commented Oct 7, 2025

Uh oh!

romsahel commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtimberlake left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewtimberlake Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

romsahel Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtimberlake Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

andrewtimberlake Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

romsahel commented Oct 7, 2025 •

edited

Loading

romsahel commented Oct 9, 2025 •

edited

Loading

romsahel Oct 12, 2025 •

edited

Loading