-
Notifications
You must be signed in to change notification settings - Fork 1k
Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅 |
Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with. |
Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles. |
I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀 |
} | ||
} | ||
|
||
pub(crate) fn parse_column_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs
. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there are details (the parse_xxx_index
methods need access to private fields in the metadata). I guess leave it here for now.
Which issue does this PR close?
Rationale for this change
The current ParquetMetadataDecoder intermixes three things:
This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases
Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder
What changes are included in this PR?
This almost certainly will conflict with @etseidl 's plans in thrift-remodel.
Are these changes tested?
by existing tests
Are there any user-facing changes?
Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example