-
Couldn't load subscription status.
- Fork 70
Parser rfc 2822 memory optimization #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Parser rfc 2822 memory optimization #204
Conversation
Phase 1 of a memory usage optimization: - Process content incrementally without materializing full arrays. The new stream_extract_headers/1 function extracts headers line-by-line using Stream.unfold with :binary.split, accumulating only header lines - byte offset tracking to extract only the body portion when needed
Phase 2 of a memory usage optimization
extract_parts_ranges/2 function identifies part boundaries
first as lightweight {offset, size} tuples before extracting any content.
Only small parts (configurable threshold, default 10MB) are extracted and parsed;
Large parts are replaced with placeholder messages.
|
Just curious what the context is for an email that large? @andrewtimberlake I'm 👍 on this PR For the next step I think a formal parser is the real solution here. The best solution would be in C to avoid all of the mem copy from pass by value function calls in Elixir. But if kept in Elixr then writing a straight up single-pass parser is do-able too and going to be far better than the original implementation I wrote. |
|
Thanks for the quick comment!
Apparently, it’s common business practice to send emails with 40 MB+ PDF attachments 🤷 FYI: besides providing a more flexible way to handle large parts, I also plan on adding a public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this.
lib/mail/parsers/rfc_2822.ex
Outdated
| * `:charset_handler` - A function that takes a charset and binary and returns a binary. Defaults to return the string as is. | ||
| * `:header_only` - Whether to parse only the headers. Defaults to false. | ||
| * `:max_part_size` - The maximum size of a part in bytes. Defaults to 10MB. | ||
| * `:skip_large_parts?` - Whether to skip parts larger than `max_part_size`. Defaults to false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t need :skip_large_parts? as we can just use :max_part_size (if not nil, skip)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, following your comments I went ahead and made more progress on my more flexible approach.
We now have a parts_handler_fn that allows for more deciding factors (size-based, content-type-based, etc.)
See the doc:
https://github.com/romsahel/elixir-mail/blob/9acedefdac663ffa2d72f57138ad669a8806e790/lib/mail/parsers/rfc_2822.ex#L48-L68
@doc """
* `:parts_handler_fn` - A function invoked for each multipart message part. Receives `part_info`, `message` (with parsed headers), and `opts`. Defaults to nil (normal parsing).
### Handler Return Values
* `:parse` - Continues with normal parsing of the part's body
* `%Mail.Message{}` - Returns a custom message structure (headers are already parsed, you provide the body)
lib/mail/parsers/rfc_2822.ex
Outdated
| |> Enum.map(fn {start, size} -> | ||
| if skip_large_parts? and size > size_threshold do | ||
| # Don't extract or parse large parts: return placeholder | ||
| %Mail.Message{body: "[Part skipped: #{size} bytes - too large to parse]"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t love this. I had a look at your draft and I think a way to handle these parts is important
|
|
||
| assert message.body == nil | ||
|
|
||
| [text_part, html_part, headers_only_part] = message.parts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be explicit in testing what a skipped part returns.
Replaces max_part_size and skip_large_parts? with a more flexible callback-based approach. The parts_handler_fn receives part metadata and parsed headers, allowing conditional parsing based on size, content-type, or any other criteria.
704ac07 to
9acedef
Compare
Problem Statement
We were experiencing heap exhaustion in the RFC2822 parser when processing large emails with attachments, ultimately triggering
"maximum heap size reached"errors on our per-process memory-constrained system.Memory profiling showed heap size jumping to 500+ MB when processing a 46MB eml.
Root Causes
1. Header Parsing Memory Accumulation
The original implementation used String.split(content, "\r\n") to parse the headers line by line. For large emails, this created massive arrays consuming significant heap space. Even after header extraction, the body lines remained in memory.
2. Body Parsing Memory Accumulation
After header extraction, the body was split into lines again. For multipart emails, each part was extracted as a full binary immediately, causing large attachments to be loaded entirely into memory for parsing. Recursive parsing of parts caused cascading memory growth.
Solution
1. Lazy Streaming
Process content incrementally without materializing full arrays. The new
stream_extract_headers/1function extracts headers line-by-line usingStream.unfoldwith:binary.split, accumulating only header lines.2. Byte Offset Tracking
Instead of extracting content, we track where content is located:
stream_extract_headers/1returns{headers, body_offset, has_separator}where body_offset is simply an integer indicating the byte position where the body startsbinary_part(content, body_offset, size)to extract only the body portion when neededextract_parts_ranges/2returns[{offset1, size1}, {offset2, size2}, ...]instead of[part1_binary, part2_binary, ...]3. Conditional Extraction
extract_parts_ranges/2function identifies part boundaries first as lightweight{offset, size}tuples before extracting any content.The
parsefunction takes a:parts_handler_fnargument that allows to configure how subsequent parts are handled.An example would be to parse only the headers of files that have a size greater than a given threshold:
Notes