You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52582][SQL] Improve the memory usage of XML parser
### What changes were proposed in this pull request?
Today, the XML parser is not memory efficient. It loads each XML record into memory first before parsing, which causes OOMs if the input XML record is large.
This PR improves the parser to parse XML records token by token to avoid copying the entire XML records into memory ahead of time. This improved parser uses less memory when parsing large XML files than the legacy parser. However, it enforces stricter validation to ensure the XML is well-formed:
1. The legacy parser doesn't scavenge all valid records deterministically. On the other hand, the improved parser will stop processing the file where malformedness is detected.
2. The legacy parser was able to handle malformed XML files with multiple root tags. However, the enhanced parser will only read the records in the first root tag.
The enhanced parser is enabled by default, but users can fallback to the legacy parser via the `spark.sql.xml.legacyParser.enabled` SQL conf.
### Why are the changes needed?
Solve the OOM issue in XML ingestion.
### Does this PR introduce _any_ user-facing change?
No. The new behavior is disabled by default for now.
### How was this patch tested?
New UTs.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#51287 from xiaonanyang-db/SPARK-52582.
Authored-by: Xiaonan Yang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments