Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb · 2025-09-12T19:46:51Z

Which issue does this PR close?

part of [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000
Follow on to [Parquet] Add ParquetMetadataPushDecoder #8080

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
orchestrating IO (aka calling read, etc)
Decoding thrift encoded byte into objets

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

Extract decoding state machine into PushMetadataDecoder
Update ParquetMetadataDecoder to use the PushMetadataDecoder
Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

etseidl · 2025-09-12T20:11:21Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

alamb · 2025-09-12T20:14:10Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

etseidl · 2025-09-12T20:17:56Z

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

etseidl · 2025-09-15T16:04:55Z

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

etseidl · 2025-09-15T16:30:06Z

parquet/src/file/metadata/parser.rs

+    }
+}
+
+pub(crate) fn parse_column_index(


One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.

That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense

Hmm, there are details (the parse_xxx_index methods need access to private fields in the metadata). I guess leave it here for now.

alamb added 4 commits September 12, 2025 13:50

Move state machine into ParquetMetadataDecoder

b28ac8c

checkpoint

8a7a993

Move code around

719bcb4

checkpoint

fbb879a

github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025

alamb changed the title ~~Alamb/refactor push decoder~~ Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025

This was referenced Sep 12, 2025

[thrift-remodel] Begin replacing file metadata reader and convert footer decryption code #8313

Merged

[Parquet] Add ParquetMetadataPushDecoder #8080

Merged

alamb added 7 commits September 12, 2025 16:28

remove dead code

01871ee

Remove more redundancy

b77c6f5

move more code

93c08d3

fixups

db9ecc4

Add metadata parsing

c411e3e

tests passing

a768eaf

clippy

e0de537

etseidl reviewed Sep 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

etseidl commented Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Uh oh!

alamb Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Uh oh!

Uh oh!

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Are you sure you want to change the base?

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Conversation

alamb commented Sep 12, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

etseidl commented Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!