-
Notifications
You must be signed in to change notification settings - Fork 1k
Add ability to skip or transform page encoding statistics in Parquet metadata #8797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
excerpts from new benchmarks Skipping the stats is not any faster than turning them into a mask 😮. |
f069128 to
f004a1f
Compare
| // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()` | ||
| // is `true` then we're at least skipping some stats. The inner `Option` is a keep | ||
| // list of column indicies to decode. | ||
| skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my solution to per-column options. For huge schemas I didn't want a Vec<bool> that's mostly filled with false. Using Arc so cloning should be faster.
| skip_encoding_stats | behavior |
|---|---|
None |
decode all |
Some<None> |
decode none |
Some<Some<Set>> |
decode if in set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Querying this causes a noticeable regression even when the outer option is None. I'm going to investigate other options here. Marking as draft until I can get this worked out.
Edit: I spent half a day running this down, and in the end found that merging main into this branch made most of the timing differences go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bitmask would be the better (but that can be a subsequent PR perhaps)
| #[derive(Default, Debug, Clone)] | ||
| pub struct ParquetMetaDataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, | ||
| encoding_stats_as_mask: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defaults to false so no behavior change. If we want to enable using the mask by default, we can either implement Default, or change this to encoding_stats_as_vec or something to allow enabling the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally suggest we: file a ticket / PR to change the default in the next major release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8859
6827338 to
9d3350a
Compare
9d3350a to
95a77b4
Compare
| /// }) | ||
| /// } | ||
| /// ``` | ||
| pub fn page_encoding_stats_mask(&self) -> Option<&EncodingMask> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be data_page_encoding_stats_mask (or just data_page_encoding_stats) to make it clear it only has the stats for data pages.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- I think this is quite neat
Skipping the stats is not any faster than turning them into a mask 😮.
I ran the benchmarks on my machine too and I found the same thing (details below)
Given this observation, what do you think about removing the skip_encoding_stats option? If it makes the API more complicated, and doesn't make decoding faster, why are we adding it?
Running benches/metadata.rs (target/release/deps/metadata-4a5fc91819c7a7e9)
open(default) time: [9.7919 µs 9.7997 µs 9.8074 µs]
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
open(page index) time: [164.08 µs 164.17 µs 164.27 µs]
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) high mild
3 (3.00%) high severe
decode parquet metadata time: [9.2401 µs 9.2482 µs 9.2583 µs]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
decode metadata with schema
time: [5.4648 µs 5.4676 µs 5.4708 µs]
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
decode metadata with stats mask
time: [8.6880 µs 8.6920 µs 8.6964 µs]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
decode metadata with skip PES
time: [8.6114 µs 8.6178 µs 8.6249 µs]
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
decode parquet metadata (wide)
time: [41.261 ms 41.310 ms 41.360 ms]
decode metadata (wide) with schema
time: [38.399 ms 38.446 ms 38.496 ms]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
decode metadata (wide) with stats mask
time: [37.276 ms 37.335 ms 37.395 ms]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
decode metadata (wide) with skip PES
time: [37.593 ms 37.639 ms 37.686 ms]
| /// Sets page encoding stats for this column chunk. | ||
| pub fn set_page_encoding_stats(mut self, value: Vec<PageEncodingStats>) -> Self { | ||
| self.0.encoding_stats = Some(value); | ||
| self.0.encoding_stats = Some(ParquetPageEncodingStats::Full(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice here in the comments to call out that setting the stats will override and mask that was set, and vice versa.
| // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()` | ||
| // is `true` then we're at least skipping some stats. The inner `Option` is a keep | ||
| // list of column indicies to decode. | ||
| skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bitmask would be the better (but that can be a subsequent PR perhaps)
| #[derive(Default, Debug, Clone)] | ||
| pub struct ParquetMetaDataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, | ||
| encoding_stats_as_mask: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally suggest we: file a ticket / PR to change the default in the next major release
| #[derive(Default, Debug, Clone)] | ||
| pub struct ParquetMetaDataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, | ||
| encoding_stats_as_mask: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a little while to grok how encoding_stats_as_mask and skip_encoding_stats were related. But now I see they are mutually exclusive
I think maybe creating a enum rather than Option / Option /etc would be clearer to me but since it is an implementation detail this is also find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're not quite mutually exclusive...if you enable stats for a few columns, those stats can be the mask form rather than the vec. I'll make the docs a little clearer.
Well, there's a couple things coming up. I'm working on speeding up the skipping code, and when we have the skip index this will be even faster. There are also other stats to skip in there (chunk But I can also see saying if the stats are enabling page pruning, the cost savings from the pruning should outweigh the cost of decoding the stats that enable that pruning, so don't worry so much and keep this simple for now. We can make it more fine grained later if need be. |
|
Thanks for the review @alamb. I've now refactored a bit and introduced a policy enum for the encoding stats. I like this much more than the earlier API and feel it addresses my concerns about bloat. Please check it out when you have a moment. 🙏 |
Which issue does this PR close?
Rationale for this change
This builds on #8763 to add options to either skip decoding of the page encoding statistics array, or transform it to a bitmask. This gets rid of the last of the heavy allocations in the metadata decoder.
What changes are included in this PR?
Adds new options to
ParquetMetaDataOptions. Also adds more metadata benchmarks.Are these changes tested?
Yes
Are there any user-facing changes?
No, just adds new field to
ColumnChunkMetaData.