[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093

xiaonanyang-db · 2025-08-22T01:50:36Z

What changes were proposed in this pull request?

In #51287, we introduced an optimized XML parser, which is more memory-efficient. However, the new parser reads the input stream eagerly on initialization. If the file is corrupted, the error is not caught properly and handled based on the ignoreCorruptedFiles option. This PR addresses the issue.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

sandip-db · 2025-08-22T02:05:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala


-    override def hasNext: Boolean = reader.hasMoreRecord


Instead of catching the same errors in multiple places, can you defer the reader initialization?

override def hasNext: Boolean = { if (reader == null) { reader = StaxXMLRecordReader(inputStream, options) } reader.hasMoreRecord }

We can defer the initialization in this way. The hasNext function will be called outside before the actual row parsing, so the error won't be caught in this approach as well.

I have updated the code to consolidate the error handling of corrupt/missing files for parser initialization here

Actually, I found another way to defer the initialization. Updated the PR with the changes, PTAL

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

HyukjinKwon

Let's make the CI result green

HyukjinKwon · 2025-08-26T06:07:47Z

Merged to master.

github-actions bot added the SQL label Aug 22, 2025

sandip-db reviewed Aug 22, 2025

View reviewed changes

xiaonanyang-db requested a review from sandip-db August 22, 2025 18:16

draft

88ff2ab

xiaonanyang-db force-pushed the SPARK-53349 branch from d5665fb to 88ff2ab Compare August 22, 2025 18:28

u

662cc76

xiaonanyang-db commented Aug 22, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala Show resolved Hide resolved

u

d10a865

sandip-db approved these changes Aug 22, 2025

View reviewed changes

u

63cbb77

HyukjinKwon approved these changes Aug 24, 2025

View reviewed changes

xiaonanyang-db added 2 commits August 24, 2025 21:31

u

87cc41e

u

eb61cb4

HyukjinKwon closed this in f0a3a2e Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093

[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093

Uh oh!

xiaonanyang-db commented Aug 22, 2025

Uh oh!

sandip-db Aug 22, 2025

Uh oh!

xiaonanyang-db Aug 22, 2025 •

edited

Loading

Uh oh!

xiaonanyang-db Aug 22, 2025 •

edited

Loading

Uh oh!

xiaonanyang-db Aug 22, 2025

Uh oh!

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Aug 26, 2025

Uh oh!

Uh oh!

[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093

[SPARK-53349][SQL] Optimized XML parser can't handle corrupted files correctly #52093

Uh oh!

Conversation

xiaonanyang-db commented Aug 22, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sandip-db Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 26, 2025

Uh oh!

Uh oh!

xiaonanyang-db Aug 22, 2025 •

edited

Loading

xiaonanyang-db Aug 22, 2025 •

edited

Loading