Add ability to skip or transform page encoding statistics in Parquet metadata #8797

etseidl · 2025-11-06T01:36:29Z

Which issue does this PR close?

Closes Reduce allocations in ParquetMetaData for improved performance #8518.

Rationale for this change

This builds on #8763 to add options to either skip decoding of the page encoding statistics array, or transform it to a bitmask. This gets rid of the last of the heavy allocations in the metadata decoder.

What changes are included in this PR?

Adds new options to ParquetMetaDataOptions. Also adds more metadata benchmarks.

Are these changes tested?

Yes

Are there any user-facing changes?

No, just adds new field to ColumnChunkMetaData.

etseidl · 2025-11-06T01:41:31Z

excerpts from new benchmarks

decode parquet metadata time:   [15.050 µs 15.105 µs 15.164 µs]
decode metadata with schema
                        time:   [7.7038 µs 7.7286 µs 7.7561 µs]
decode metadata with stats mask
                        time:   [14.035 µs 14.100 µs 14.182 µs
decode metadata with skip PES
                        time:   [13.976 µs 14.016 µs 14.060 µs]
decode parquet metadata (wide)
                        time:   [54.013 ms 54.236 ms 54.468 ms]
decode metadata (wide) with schema
                        time:   [48.399 ms 48.562 ms 48.738 ms]
decode metadata (wide) with stats mask
                        time:   [44.912 ms 45.077 ms 45.253 ms]
decode metadata (wide) with skip PES
                        time:   [44.500 ms 44.616 ms 44.739 ms]

Skipping the stats is not any faster than turning them into a mask 😮.

etseidl · 2025-11-07T19:55:22Z

parquet/src/file/metadata/options.rs

+    // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()`
+    // is `true` then we're at least skipping some stats. The inner `Option` is a keep
+    // list of column indicies to decode.
+    skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>,


This is my solution to per-column options. For huge schemas I didn't want a Vec<bool> that's mostly filled with false. Using Arc so cloning should be faster.

skip_encoding_stats behavior

None decode all

Some<None> decode none

Some<Some<Set>> decode if in set

Querying this causes a noticeable regression even when the outer option is None. I'm going to investigate other options here. Marking as draft until I can get this worked out.

Edit: I spent half a day running this down, and in the end found that merging main into this branch made most of the timing differences go away.

Maybe a bitmask would be the better (but that can be a subsequent PR perhaps)

etseidl · 2025-11-07T20:10:24Z

parquet/src/file/metadata/options.rs

+#[derive(Default, Debug, Clone)]
+pub struct ParquetMetaDataOptions {
+    schema_descr: Option<SchemaDescPtr>,
+    encoding_stats_as_mask: bool,


This defaults to false so no behavior change. If we want to enable using the mask by default, we can either implement Default, or change this to encoding_stats_as_vec or something to allow enabling the old behavior.

I personally suggest we: file a ticket / PR to change the default in the next major release

Filed #8859

etseidl · 2025-11-11T18:37:05Z

parquet/src/file/metadata/mod.rs

+    ///         })
+    /// }
+    /// ```
+    pub fn page_encoding_stats_mask(&self) -> Option<&EncodingMask> {


I wonder if this should be data_page_encoding_stats_mask (or just data_page_encoding_stats) to make it clear it only has the stats for data pages.

parquet/src/file/metadata/mod.rs

parquet/src/file/metadata/options.rs

alamb

Thanks @etseidl -- I think this is quite neat

Skipping the stats is not any faster than turning them into a mask 😮.

I ran the benchmarks on my machine too and I found the same thing (details below)

Given this observation, what do you think about removing the skip_encoding_stats option? If it makes the API more complicated, and doesn't make decoding faster, why are we adding it?

     Running benches/metadata.rs (target/release/deps/metadata-4a5fc91819c7a7e9)
open(default)           time:   [9.7919 µs 9.7997 µs 9.8074 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

open(page index)        time:   [164.08 µs 164.17 µs 164.27 µs]
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

decode parquet metadata time:   [9.2401 µs 9.2482 µs 9.2583 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

decode metadata with schema
                        time:   [5.4648 µs 5.4676 µs 5.4708 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

decode metadata with stats mask
                        time:   [8.6880 µs 8.6920 µs 8.6964 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

decode metadata with skip PES
                        time:   [8.6114 µs 8.6178 µs 8.6249 µs]
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

decode parquet metadata (wide)
                        time:   [41.261 ms 41.310 ms 41.360 ms]

decode metadata (wide) with schema
                        time:   [38.399 ms 38.446 ms 38.496 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

decode metadata (wide) with stats mask
                        time:   [37.276 ms 37.335 ms 37.395 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

decode metadata (wide) with skip PES
                        time:   [37.593 ms 37.639 ms 37.686 ms]

alamb · 2025-11-17T13:38:22Z

parquet/src/file/metadata/mod.rs

    /// Sets page encoding stats for this column chunk.
    pub fn set_page_encoding_stats(mut self, value: Vec<PageEncodingStats>) -> Self {
-        self.0.encoding_stats = Some(value);
+        self.0.encoding_stats = Some(ParquetPageEncodingStats::Full(value));


It might be nice here in the comments to call out that setting the stats will override and mask that was set, and vice versa.

alamb · 2025-11-17T13:43:46Z

parquet/src/file/metadata/options.rs

+    // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()`
+    // is `true` then we're at least skipping some stats. The inner `Option` is a keep
+    // list of column indicies to decode.
+    skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>,


Maybe a bitmask would be the better (but that can be a subsequent PR perhaps)

alamb · 2025-11-17T13:44:24Z

parquet/src/file/metadata/options.rs

+#[derive(Default, Debug, Clone)]
+pub struct ParquetMetaDataOptions {
+    schema_descr: Option<SchemaDescPtr>,
+    encoding_stats_as_mask: bool,


I personally suggest we: file a ticket / PR to change the default in the next major release

alamb · 2025-11-17T13:52:24Z

parquet/src/file/metadata/options.rs

 #[derive(Default, Debug, Clone)]
 pub struct ParquetMetaDataOptions {
    schema_descr: Option<SchemaDescPtr>,
+    encoding_stats_as_mask: bool,


It took me a little while to grok how encoding_stats_as_mask and skip_encoding_stats were related. But now I see they are mutually exclusive

I think maybe creating a enum rather than Option / Option /etc would be clearer to me but since it is an implementation detail this is also find

They're not quite mutually exclusive...if you enable stats for a few columns, those stats can be the mask form rather than the vec. I'll make the docs a little clearer.

parquet/src/file/metadata/options.rs

etseidl · 2025-11-17T15:35:23Z

Given this observation, what do you think about removing the skip_encoding_stats option? If it makes the API more complicated, and doesn't make decoding faster, why are we adding it?

Well, there's a couple things coming up. I'm working on speeding up the skipping code, and when we have the skip index this will be even faster. There are also other stats to skip in there (chunk Statistics, size stats, geo stats, bloom filter pointers). We could have a single option to skip all of them, but I can see wanting to enable some and not others depending on the use case. Filtering on a sorted column would want the chunk stats, filtering on unsorted might want the bloom filter but not other stats. If I want a size estimate for planning purposes but don't plan on filtering I would want size stats and nothing else. I think we should support all of these. I can see wanting different options for different columns in a single query.

But I can also see saying if the stats are enabling page pruning, the cost savings from the pruning should outweigh the cost of decoding the stats that enable that pruning, so don't worry so much and keep this simple for now. We can make it more fine grained later if need be.

API bloat

etseidl · 2025-11-17T21:06:25Z

Thanks for the review @alamb. I've now refactored a bit and introduced a policy enum for the encoding stats. I like this much more than the earlier API and feel it addresses my concerns about bloat. Please check it out when you have a moment. 🙏

github-actions bot added the parquet Changes to the parquet crate label Nov 6, 2025

etseidl force-pushed the page_enc_stats branch 2 times, most recently from f069128 to f004a1f Compare November 7, 2025 18:30

etseidl commented Nov 7, 2025

View reviewed changes

etseidl mentioned this pull request Nov 10, 2025

Add options to control various aspects of Parquet metadata decoding #8763

Merged

etseidl force-pushed the page_enc_stats branch from 6827338 to 9d3350a Compare November 10, 2025 19:27

add options to control page encoding stats reading

95a77b4

etseidl force-pushed the page_enc_stats branch from 9d3350a to 95a77b4 Compare November 11, 2025 18:21

etseidl marked this pull request as ready for review November 11, 2025 18:30

etseidl changed the title ~~[WIP] Add ability to skip or transform page encoding statistics in Parquet metadata~~ Add ability to skip or transform page encoding statistics in Parquet metadata Nov 11, 2025

etseidl commented Nov 11, 2025

View reviewed changes

etseidl added 2 commits November 11, 2025 13:21

typo

e545319

rework tests

c096b20

etseidl commented Nov 11, 2025

View reviewed changes

parquet/src/file/metadata/mod.rs Outdated Show resolved Hide resolved

etseidl commented Nov 11, 2025

View reviewed changes

parquet/src/file/metadata/options.rs Outdated Show resolved Hide resolved

improve docs

5aaa8e3

etseidl commented Nov 12, 2025

View reviewed changes

parquet/src/file/metadata/options.rs Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into page_enc_stats

ae3c9da

etseidl marked this pull request as draft November 13, 2025 18:06

Merge branch 'main' into page_enc_stats

466c7b5

etseidl marked this pull request as ready for review November 13, 2025 23:57

etseidl added 5 commits November 14, 2025 12:34

wrap page encoding stats in an enum to reduce bloat

3c4c393

Merge remote-tracking branch 'origin/main' into page_enc_stats

6c0d65c

fix test

7af591d

more docs

9c4614d

Merge remote-tracking branch 'origin/main' into page_enc_stats

6feaefc

alamb mentioned this pull request Nov 16, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-17 apache/datafusion#18711

Open

34 tasks

alamb reviewed Nov 17, 2025

View reviewed changes

add some more documentation

7717b6b

etseidl mentioned this pull request Nov 17, 2025

Change default behavior for Parquet PageEncodingStats #8859

Open

add an enum to set decoding policy for stats fields to reduce

b9195d1

API bloat

etseidl mentioned this pull request Nov 17, 2025

Change default behavior for Parquet PageEncodingStats #8859 #8860

Closed

skip_encoding_stats	behavior
`None`	decode all
`Some<None>`	decode none
`Some<Some<Set>>`	decode if in set

Add ability to skip or transform page encoding statistics in Parquet metadata #8797

Are you sure you want to change the base?

Add ability to skip or transform page encoding statistics in Parquet metadata #8797

Conversation

etseidl commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

etseidl commented Nov 17, 2025

Uh oh!

etseidl commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

etseidl commented Nov 6, 2025 •

edited

Loading

etseidl Nov 13, 2025 •

edited

Loading