-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Support read-only metadata builders #8208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs error will self-resolve once #8206 merges:
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich -- this makes sense to me. I had a suggestion regarding naming, but overall it looks really nice 👌
cc @codephage2020 and @klion26, perhaps you would like to review this too
parquet-variant/src/builder.rs
Outdated
/// Builder for constructing metadata for [`Variant`] values. | ||
/// | ||
/// This is used internally by the [`VariantBuilder`] to construct the metadata | ||
/// | ||
/// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. | ||
#[derive(Default, Debug)] | ||
struct MetadataBuilder { | ||
struct BasicMetadataBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that Basic is a bit generic and it is somewhat unclear what the difference between a "ReadOnly" builder and a "Basic" builder are.
Here are some other ideas for names:
WriteableMetadataBuilder
-- can update the metadata (this is my preference)OwnedMetadataBuilder
--- the builder owns the underlying structures (can thus can make changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with WritableMetadataBuilder
.
@@ -589,14 +675,14 @@ enum ParentState<'a> { | |||
Variant { | |||
value_builder: &'a mut ValueBuilder, | |||
saved_value_builder_offset: usize, | |||
metadata_builder: &'a mut MetadataBuilder, | |||
metadata_builder: &'a mut dyn MetadataBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using &dyn
here I think means all accesses now need a level of indirection (and I think we can see this slowdown of a few percent in the benchmarks)
I wonder if (maybe as a follow on PR) we can consider making parent state generic and squeezing that performance back. It may also be premature optimization for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was trying to avoid the "pollution" that generics would cause, which is clearly visible in my first attempt:
It's not just ParentState
, but also ListBuilder
, ObjectBuilder
and even the VariantBuilderExt
trait. Plus it requires a bunch of methods that are only defined for specific combinations of types.
But we may need to go generic now, because VariantArrayVariantBuilder::finish
needs access to BasicMetadataBuilder::finish
in order to work correctly, and that's unavailable because the parent state has upcast it to &mut dyn MetadataBuilder
.
I initially reached for std::any::AsAny
, but it turns out that types with non-static lifetimes cannot implement std::any::Any
because the lifetime information would be lost.
In order to get the latest merge with upstream to compile, I had to define a new MetadataBuilder::finish
trait method to expose the underlying BasicMetadataBuilder::finish
, but that also forced a no-op ReadOnlyMetadataBuilder::finish
.
Honestly, the generic mess is messy enough that I lean strongly toward keeping the dyn
indirection unless we're really certain we need generics for performance or functionality reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related issue we'll hit shortly: VariantArrayBuilder
is currently hard-wired to use a BasicMetadataBuilder
that produces a new metadata column.
Once we need a variant array builder that reuses an existing metadata column (e.g. for unshredding), we'll be forced to decide whether we want to make VariantArrayBuilder
generic over M: MetadataBuilder
or just define a second builder class.
I suspect that choice will ultimately decide the generic-vs-not question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could also avoid generics altogether by defining our own versions of Any
and AsAny
that preserve lifetime info... but they would become part of the public arrow-rs API which seems awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, the generic mess is messy enough that I lean strongly toward keeping the dyn indirection unless we're really certain we need generics for performance or functionality reasons.
I think this is a wise strategy. Let's wait for some more "end to end" type benchmarks (like reading/writing JSON to arrays, or shredding variants) and see if we need to try and squeeze a few more percent out
(this PR also needs an update to resolve some merge conflcits) |
Thanks again @scovich |
Sorry for the late reply, I just came back from out yesterday and just saw the notifications. There is no more comment from me. |
Which issue does this PR close?
Rationale for this change
When manipulating existing variant values (unshredding, removing fields, etc), the metadata column is already defined and already contains all necessary field ids. In fact, defining new/different field ids would require rewriting the bytes of those already-encoded variant values. We need a way to build variant values that rely on an existing metadata dictionary.
What changes are included in this PR?
MetadataBuilder
is now a trait, and most methods that work with metadata builders now take&mut dyn MetadataBuilder
instead of&mut MetadataBuilder
.MetadataBuilder
struct is nowBasicMetadataBuilder
that implementsMetadataBuilder
ReadOnlyMetadataBuilder
that wraps aVariantMetadata
and which also implementsMetadataBuilder
try_binary_search_range_by
helper method to be more general, so we can define an efficientVariantMetadata::get_entry
that returns the field id for a given field name.Are these changes tested?
Existing tests cover the basic metadata builder. New tests added to cover the read-only metadata builder.
Are there any user-facing changes?
The renamed
BasicMetadataBuilder
(breaking), the newMetadataBuilder
trait (breaking), and the newReadOnlyMetadataBuilder
.