-
Notifications
You must be signed in to change notification settings - Fork 1k
[Parquet] Minor: Update comments in page decompressor #8764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| // decompressed size of zero corresponds to a page with only null values | ||
| // see https://github.com/apache/parquet-format/blob/master/README.md#data-pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is ok to me, but I don't know would some page being just be empty (with num-rows == 0), I know currently most writer would not write that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I did some code coverage testing, I found it was rare but not impossible (it happens 3 times in the parquet-rs suite: #8756 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would page without values or some word better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would page without values or some word better here?
I think the current wording is correct. In this case num_values is non-zero, so there are values, but they all just happen to be null (and thus are not encoded in the data).
I'd think the ways decompressed_size can be zero are 1) v2 page with all null values, 2) v1 page with no nesting and all nulls at the top level (so D is always 0). Second case would have definition level data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current wording is correct. In this case num_values is non-zero, so there are values, but they all just happen to be null (and thus are not encoded in the data).
I mean, I don't know would a case that just num_rows ( not num_values ) == 0 would exists
Which issue does this PR close?
Rationale for this change
@etseidl comments: #8756 (comment)
While I was in here, I also wanted to capture the learning based on @mapleFU 's comment #8756 (comment)
What changes are included in this PR?
Include some comments
Are these changes tested?
No (there are no code changes)
Are there any user-facing changes?
No, this is internal comments only. No code / behavior changes