Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 12, 2025

Which issue does this PR close?

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

  1. The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
  2. orchestrating IO (aka calling read, etc)
  3. Decoding thrift encoded byte into objets

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

  1. Extract decoding state machine into PushMetadataDecoder
  2. Update ParquetMetadataDecoder to use the PushMetadataDecoder
  3. Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025
@alamb alamb changed the title Alamb/refactor push decoder Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025
@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

@etseidl
Copy link
Contributor

etseidl commented Sep 15, 2025

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

}
}

pub(crate) fn parse_column_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there are details (the parse_xxx_index methods need access to private fields in the metadata). I guess leave it here for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants