Skip to content

Conversation

xiaonanyang-db
Copy link
Contributor

What changes were proposed in this pull request?

In #51287, we introduced an optimized XML parser, which is more memory-efficient. However, the new parser reads the input stream eagerly on initialization. If the file is corrupted, the error is not caught properly and handled based on the ignoreCorruptedFiles option. This PR addresses the issue.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the SQL label Aug 22, 2025

override def hasNext: Boolean = reader.hasMoreRecord
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of catching the same errors in multiple places, can you defer the reader initialization?

override def hasNext: Boolean = {
  if (reader == null) {
     reader = StaxXMLRecordReader(inputStream, options)
  }
   reader.hasMoreRecord
}

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer the initialization in this way. The hasNext function will be called outside before the actual row parsing, so the error won't be caught in this approach as well.

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to consolidate the error handling of corrupt/missing files for parser initialization here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found another way to defer the initialization. Updated the PR with the changes, PTAL

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the CI result green

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants