Skip to content

Conversation

@romsahel
Copy link
Contributor

@romsahel romsahel commented Oct 7, 2025

Problem Statement

We were experiencing heap exhaustion in the RFC2822 parser when processing large emails with attachments, ultimately triggering "maximum heap size reached" errors on our per-process memory-constrained system.

Memory profiling showed heap size jumping to 500+ MB when processing a 46MB eml.

Root Causes

1. Header Parsing Memory Accumulation

The original implementation used String.split(content, "\r\n") to parse the headers line by line. For large emails, this created massive arrays consuming significant heap space. Even after header extraction, the body lines remained in memory.

2. Body Parsing Memory Accumulation

After header extraction, the body was split into lines again. For multipart emails, each part was extracted as a full binary immediately, causing large attachments to be loaded entirely into memory for parsing. Recursive parsing of parts caused cascading memory growth.

Solution

1. Lazy Streaming

Process content incrementally without materializing full arrays. The new stream_extract_headers/1 function extracts headers line-by-line using Stream.unfold with :binary.split, accumulating only header lines.

2. Byte Offset Tracking

Instead of extracting content, we track where content is located:

  • stream_extract_headers/1 returns {headers, body_offset, has_separator} where body_offset is simply an integer indicating the byte position where the body starts
  • binary_part(content, body_offset, size) to extract only the body portion when needed
  • Similarly, extract_parts_ranges/2 returns [{offset1, size1}, {offset2, size2}, ...] instead of [part1_binary, part2_binary, ...]

3. Conditional Extraction

extract_parts_ranges/2 function identifies part boundaries first as lightweight {offset, size} tuples before extracting any content.

The parse function takes a :parts_handler_fn argument that allows to configure how subsequent parts are handled.
An example would be to parse only the headers of files that have a size greater than a given threshold:

          parts_handler_fn: fn part_info, message, _opts ->
            if part_info.size > 25 do
              Map.put(message, :body, "[Headers only - body skipped: #{part_info.size} bytes]")
            else
              :parse
            end
          end

Notes

  • The public API has not changed except for the new available options. Previous behavior is also preserved.

Phase 1 of a memory usage optimization:
- Process content incrementally without materializing full arrays.
  The new stream_extract_headers/1 function extracts headers line-by-line
  using Stream.unfold with :binary.split, accumulating only header lines

- byte offset tracking to extract only the body portion when needed
Phase 2 of a memory usage optimization
extract_parts_ranges/2 function identifies part boundaries
first as lightweight {offset, size} tuples before extracting any content.

Only small parts (configurable threshold, default 10MB) are extracted and parsed;
Large parts are replaced with placeholder messages.
@romsahel romsahel marked this pull request as ready for review October 7, 2025 07:46
@bcardarella
Copy link
Member

Just curious what the context is for an email that large?

@andrewtimberlake I'm 👍 on this PR

For the next step I think a formal parser is the real solution here. The best solution would be in C to avoid all of the mem copy from pass by value function calls in Elixir. But if kept in Elixr then writing a straight up single-pass parser is do-able too and going to be far better than the original implementation I wrote.

@romsahel
Copy link
Contributor Author

romsahel commented Oct 9, 2025

Thanks for the quick comment!

Just curious what the context is for an email that large?

Apparently, it’s common business practice to send emails with 40 MB+ PDF attachments 🤷

FYI: besides providing a more flexible way to handle large parts, I also plan on adding a public parse function that can take a Stream as input. On our end, that would let us stream the EML directly into the parser, so – I think – we’d never have to load the full EML into memory.

Copy link
Collaborator

@andrewtimberlake andrewtimberlake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this.

* `:charset_handler` - A function that takes a charset and binary and returns a binary. Defaults to return the string as is.
* `:header_only` - Whether to parse only the headers. Defaults to false.
* `:max_part_size` - The maximum size of a part in bytes. Defaults to 10MB.
* `:skip_large_parts?` - Whether to skip parts larger than `max_part_size`. Defaults to false.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t need :skip_large_parts? as we can just use :max_part_size (if not nil, skip)

Copy link
Contributor Author

@romsahel romsahel Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, following your comments I went ahead and made more progress on my more flexible approach.
We now have a parts_handler_fn that allows for more deciding factors (size-based, content-type-based, etc.)
See the doc:
https://github.com/romsahel/elixir-mail/blob/9acedefdac663ffa2d72f57138ad669a8806e790/lib/mail/parsers/rfc_2822.ex#L48-L68

  @doc """
    * `:parts_handler_fn` - A function invoked for each multipart message part. Receives `part_info`, `message` (with parsed headers), and `opts`. Defaults to nil (normal parsing).

  ### Handler Return Values

    * `:parse` - Continues with normal parsing of the part's body
    * `%Mail.Message{}` - Returns a custom message structure (headers are already parsed, you provide the body)

|> Enum.map(fn {start, size} ->
if skip_large_parts? and size > size_threshold do
# Don't extract or parse large parts: return placeholder
%Mail.Message{body: "[Part skipped: #{size} bytes - too large to parse]"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t love this. I had a look at your draft and I think a way to handle these parts is important


assert message.body == nil

[text_part, html_part, headers_only_part] = message.parts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be explicit in testing what a skipped part returns.

Replaces max_part_size and skip_large_parts? with a more flexible callback-based approach.
The parts_handler_fn receives part metadata and parsed headers,
allowing conditional parsing based on size, content-type, or any other criteria.
@romsahel romsahel force-pushed the parser-rfc-2822-memory-optimization branch 2 times, most recently from 704ac07 to 9acedef Compare October 12, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants