-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
override def hasNext: Boolean = reader.hasMoreRecord |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of catching the same errors in multiple places, can you defer the reader
initialization?
override def hasNext: Boolean = {
if (reader == null) {
reader = StaxXMLRecordReader(inputStream, options)
}
reader.hasMoreRecord
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can defer the initialization in this way. The hasNext
function will be called outside before the actual row parsing, so the error won't be caught in this approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to consolidate the error handling of corrupt/missing files for parser initialization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I found another way to defer the initialization. Updated the PR with the changes, PTAL
d5665fb
to
88ff2ab
Compare
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the CI result green
Merged to master. |
What changes were proposed in this pull request?
In #51287, we introduced an optimized XML parser, which is more memory-efficient. However, the new parser reads the input stream eagerly on initialization. If the file is corrupted, the error is not caught properly and handled based on the
ignoreCorruptedFiles
option. This PR addresses the issue.Why are the changes needed?
Bug fix
Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?