-
Notifications
You must be signed in to change notification settings - Fork 1k
[WIP][Variant] Very rough pathfinding for variant get/shredding #8122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping some notes and observations I took away from this exercise.
// Find the value field, if present | ||
let value = inner | ||
.column_by_name("value") | ||
.map(|v| { | ||
v.as_binary_view_opt().ok_or_else(|| { | ||
ArrowError::NotYetImplemented(format!( | ||
"VariantArray 'value' field must be BinaryView, got {}", | ||
v.data_type() | ||
)) | ||
}) | ||
}) | ||
.transpose()?; | ||
|
||
// Find the typed_value field, if present | ||
let typed_value = inner.column_by_name("typed_value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored out into the new VariantArray::from_parts
function below.
inner: StructArray, | ||
|
||
/// The metadata column of this variant | ||
metadata: BinaryViewArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this pathfinding, ShreddingState
-- already used for referring to both top-level variant columns (fields: metadata, value, typed_value) -- also proved useful for referring to nested shredded variant columns (fields: value, typed_value).
So I hoisted out the metadata
column to unlock that new use case.
// This would be a lot simpler if ShreddingState were just a pair of Option... we already | ||
// have everything we need. | ||
let inner = builder.build(); | ||
let shredding_state = ShreddingState::try_new(&inner).unwrap(); // valid by construction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not quite "valid by construction" today, because the enum can't handle the (None, None) case yet.
But it will, soon enough.
Variant::new(self.metadata.value(index), value.value(index)) | ||
} | ||
ShreddingState::Typed { typed_value, .. } => { | ||
ShreddingState::PerfectlyShredded { typed_value, .. } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commentary below for an explanation of these renamed enum variants.
/// One shredded field of a partially or prefectly shredded variant. For example, suppose the | ||
/// shredding schema for variant `v` treats it as an object with a single field `a`, where `a` is | ||
/// itself a struct with the single field `b` of type INT. Then the physical layout of the column | ||
/// is: | ||
/// | ||
/// ```text | ||
/// v: VARIANT { | ||
/// metadata: BINARY, | ||
/// value: BINARY, | ||
/// typed_value: STRUCT { | ||
/// a: SHREDDED_VARIANT_FIELD { | ||
/// value: BINARY, | ||
/// typed_value: STRUCT { | ||
/// a: SHREDDED_VARIANT_FIELD { | ||
/// value: BINARY, | ||
/// typed_value: INT, | ||
/// }, | ||
/// }, | ||
/// }, | ||
/// }, | ||
/// } | ||
/// ``` | ||
/// | ||
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an | ||
/// object at all) or a variant object (partial shredding, `v` was an object but included unexpected | ||
/// fields other than `a`), or is NULL (perfect shredding, `v` was an object containing only the | ||
/// single expected field `a`). | ||
/// | ||
/// A similar story unfolds for each `v.typed_value.a.value` -- a variant value if shredding failed | ||
/// (`v:a` was not an object at all), or a variant object (`v:a` was an object with unexpected | ||
/// additional fields), or NULL (`v:a` was an object containing only the single expected field `b`). | ||
/// | ||
/// Finally, `v.typed_value.a.typed_value.b.value` is either NULL (`v:a.b` was an integer) or else a | ||
/// variant value. | ||
pub struct ShreddedVariantFieldArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a marginally useful new Array that represents a shredded variant field. Unlike the top-level variant, it has only (value?, typed_value?)
fields -- no metadata. The variant shredding spec mandates this, and I hesitate to "just" reuse VariantArray
with its metadata
field. Writing that out to a parquet file would violate the spec, which seems like too big of a footgun and a large departure from typical arrow arrays that physically match the structure of the underlying parquet.
I say "marginally useful" because:
- Without a
metadata
column, one cannot actually define avalue
method for it. - The pathfinding code in this PR only uses it as a thin container -- all the code is built around the shared
ShreddingState
concept. - There is only one use site, in
follow_shredded_path_element
, so we could work directly withStructArray
and just validate it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @alamb --
If we could somehow figure out to make this be VariantArray rather than ShreddedVariantFieldArray I think that would be the most elegant / understandable
That was my initial reaction as well. But the metadata
field becomes problematic. Do we make it optional? Require it anyway and risk writing out invalid parquet if somebody forgets to strip it out or ignore it? etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: The variant shredding spec doesn't name the type of the group that represents shredded object fields, but it's definitely NOT VARIANT. It's some other thing that must never have a metadata
column and which must always nest inside a VARIANT type that provides the metadata
column it depends on. And it's required
where VARIANT is optional
. And it allows both value
and typed_value
to be NULL/missing where VARIANT forbids that combination.
In other words the two are really not the same thing -- other than they happen to contain two fields with the same name -- and I worry that if we try to conflate the two we're just going to cause a lot of trouble for ourselves.
Edit: Tabular form might be easier to digest:
VARIANT | Object field | Array element | |
---|---|---|---|
metadata | required | forbidden | forbidden |
group type | optional |
required |
required |
NULL/NULL | forbidden | allowed | forbidden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Array elements have almost the same funky properties as object fields: required
and no metadata
, but they forbid the NULL/NULL combo that object fields allow.
|
||
/// Builder that extracts a struct. Casting failures produce NULL or error according to options. | ||
#[allow(unused)] | ||
struct StructVariantShreddingRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terrible name, but I couldn't think of a better one. Even worse name below for partially shredded structs...
|
||
/// Used for actual shredding of binary variant values into shredded variant values | ||
#[allow(unused)] | ||
struct ShreddedVariantRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is very incomplete, but hopefully has enough of a sketch to give an idea of what the implementation should look like? Solving it requires a way of throwing around variant bytes, so e.g. we can iterate over a stream of variant fields and filter out the unwanted ones to build a new variant object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this also be used to shred unshredded variant values to shredded arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what it would be used for. If the user passed a shredded variant type to variant_get
and the path ends inside a binary variant, we have no choice but to shred accordingly (**) (***).
If the user just passed a concrete type to variant_get
, and the path ends inside a binary variant, we would instead use e.g. PrimitiveVariantShreddingRowBuilder
or StructVariantShreddingRowBuilder
.
(**) Things get complicated fast, if the user passed a shredded variant type to variant_get
and the path ends inside a shredded variant with a different schema. This pathfinding PR didn't even touch that case, but it would probably look vaguely similar to the struct specialization, where we try to follow the common/compatible paths of the two shredding schemas (in hopes of returning unmodified columns directly) and fall back to actual builders only where the paths diverge.
(***) If the user passed a binary variant schema and the underlying data were shredded, then we have to use a BinaryVariantRowBuilder
to insert the variant bytes produced by e.g. VariantArray::value
.
Option<&'a mut dyn VariantShreddingRowBuilder>, | ||
), | ||
> { | ||
std::iter::empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::iter::empty() | |
std::iter::empty() // TODO |
struct VariantShreddingStructRowBuilder { | ||
metadata: arrow::array::BinaryViewArray, | ||
nulls: NullBufferBuilder, | ||
value_builder: BinaryVariantRowBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a placeholder... I'm not quite sure what form the value builder (and its corresponding null masks) should actually take?
} | ||
} | ||
|
||
impl<'a> From<&[VariantPathElement<'a>]> for VariantPath<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntoIterator
(below) requires values, and iterators to slice references return references.
I added a link to #8083 to the description of this PR |
I plan to review this first thing tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich
TLDR is that this PR looks like it is on the right track to me. Maybe we can start breaking it down into pieces to make it easier to review and test
Some thoughts of things that might make their own good PRs
- Add tests -- for example, perhaps we can explicitly re-create the data from the spec here https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects as a test
- Add the
CopyingVariantBuilder
/ a way to copy parts of variants around - Rename the variants of
ShreddingState
- Create set of traits for
VariantAsPrimitive
, etc type traits for use converting to the arrow types
It seems like this PR has both sketches of reading shredded variant values as well as writing shredded variant which is quite cool; I know they are somewhat related, but maybe we can work on them one at a time
I am also happy to schedule a call to discuss next steps face to face -- feel free to drop me an email at [email protected] perhaps. I could also schedule something face to face for any contributor who might like to join us
/// } | ||
/// ``` | ||
/// | ||
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In going through this explanation, I think it might be helpful to make some specific examples of valid values. Maybe we can add them to the docs too -- something like
Fields could be shredded
{
"a" : {
"b": 42 <-- stored as an integer in typed_value field
}
}
Field could not be shredded
{
"a" : {
"b": "foo" <-- stored as a variant
}
}
Extra non shredded fields
{
"a" : {
"b": 42, <-- stored as an integer in typed_value field
"z": "foo" <-- stored as a variant in sub field?
}
}
But I am not sure about the last example, to be honest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those examples all look correct, yes.
/// Partially shredded: | ||
/// * value is an object | ||
/// * typed_value is a shredded object. | ||
PerfectlyShredded { typed_value: ArrayRef }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that if typed_value
was non null, it contains the value and an implementation doesn't even have to check the value
field
If typed_value
is null, then we have to check the value
field and produce a null in the output when needed
/// | ||
/// TODO: move to arrow crate | ||
#[derive(Debug, Default, Clone)] | ||
pub struct StructArrayBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is already a class called StructBuilder
which helps build struct arrays row by row.
https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html
Maybe we could add StructArrayBuilder
to the same location. With sufficient documentation examples, I think it would be straightforward to add
Is this something it would help if I looked into?
|
||
mod output; | ||
|
||
pub(crate) enum ShreddedPathStep<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this makes lots of sense
/// deeper. For a `VariantPathElement::Field`, the step fails if there is no `typed_value` at this | ||
/// level, or if `typed_value` is not a struct, or if the requested field name does not exist. | ||
/// | ||
/// TODO: Support `VariantPathElement::Index`? It wouldn't be easy, and maybe not even possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we would likely have to copy the relevant bytes to a new arary
|
||
let field = field | ||
.as_any() | ||
.downcast_ref::<ShreddedVariantFieldArray>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could somehow figure out to make this be VariantArray
rather than ShreddedVariantFieldArray
I think that would be the most elegant / understandable
/// TODO: How would a caller request a struct or list type where the fields/elements can be any | ||
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would | ||
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or | ||
/// list and then try to assemble the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one way to handle this case would be to extract the fields your code wants to work with as individual arrays, rather than trying to get back a single StructArray with the relevant fields
We don't really have a great story at the moment even for casting one struct array to another when the fields don't match up. See, for example, the discussion in
Therefore I think maybe we can actually punt on casting the result to a proper struct array recursively (we can certainly add a TODO item for a follow on PR)
} | ||
|
||
fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> { | ||
if let Some(v) = value.as_primitive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very clever
// We need a way to convert a Variant directly to bytes. In particular, we want to just copy | ||
// across the underlying value byte slice of a `Variant::Object` or `Variant::List`, without | ||
// any interaction with a `VariantMetadata` (because we will just reuse the existing one). | ||
// | ||
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a | ||
// lot of unnecessary work and would also create a new metadata column we don't need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we are missing some way to copy a variant or some subpart of the variant to a new Variant.
I wonder if we could make something that implements VariantBuilderExt
that is designed for this usecase
Something like
let (metadata, value) = get_variant(); // exitsting variant at metadata
let variant = Variant::new(metadata, value);
// Create a new variant builder for copying variants without recreating the metadata
let mut builder = CopyingVariantBuilder::new(existing_metadata, capacity);
// copying this value just copies the bytes, does not re-create the metadata
// errors if the metadata pointer doens't match maybe?
builder.append_value(&variant)?;
🤔
|
||
/// Used for actual shredding of binary variant values into shredded variant values | ||
#[allow(unused)] | ||
struct ShreddedVariantRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this also be used to shred unshredded variant values to shredded arrays?
IMO they're more than just "somewhat related" -- if you squint, shredding is precisely a |
I see -- that is a fascinating duality and I agree it is the same I still think we will be better off if we can somehow split this up into smaller pieces, so I was trying to find points where we could carve out parts |
Oh, for sure. This pathfinding PR covers way too much territory to actually merge as a single PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb -- Thanks for the insightful comments, I replied to several below.
/// a: SHREDDED_VARIANT_FIELD { | ||
/// value: BINARY, | ||
/// typed_value: STRUCT { | ||
/// a: SHREDDED_VARIANT_FIELD { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// a: SHREDDED_VARIANT_FIELD { | |
/// b: SHREDDED_VARIANT_FIELD { |
/// } | ||
/// ``` | ||
/// | ||
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those examples all look correct, yes.
/// Partially shredded: | ||
/// * value is an object | ||
/// * typed_value is a shredded object. | ||
PerfectlyShredded { typed_value: ArrayRef }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that if
typed_value
was non null, it contains the value and an implementation doesn't even have to check thevalue
field
For primitive values and arrays, yes -- (NULL, NULL)
is an invalid case and (Variant::Null, NULL)
translates to Variant::Null
(which casts to NULL
of any other type).
Objects have a multi-step process to access field f
:
- Is the
typed_value
NULL?- Yes => Not even an object => have to use
value
(or just fail, since nothing casts to struct)- It is illegal for
value
to also be NULL, because this is the top-level
- It is illegal for
- No => [partially] shredded object => continue (ignore
value
)
- Yes => Not even an object => have to use
- Does the column
typed_value.f
exist?- No =>
f
was not in the shredding schema => have to checkvalue
which may or may not be a variant object - Yes =>
f
was in the shredding schema => continue
- No =>
- Is
typed_value.f.typed_value
NULL?- Yes => This row does not contain
f
, orf
is the wrong type => fall back totyped_value.f.value
- If that is also NULL then
f
is NULL; otherwise,f
could be any variant object (includingVariant::Null
)
- If that is also NULL then
- No => This row contains
f
and it is shredded (ignoretyped_value.f.value
)
- Yes => This row does not contain
/// | ||
/// TODO: move to arrow crate | ||
#[derive(Debug, Default, Clone)] | ||
pub struct StructArrayBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on whether we think this is a good builder idiom to introduce?
At least, I haven't seen anything like it elsewhere in arrow-rs?
One nice property of this builder is it eliminates the risk of mismatch between field type and array type. But a public version of this API would still need to have a fallible alternative, because array lengths could still mismatch and cause a panic.
|
||
let field = field | ||
.as_any() | ||
.downcast_ref::<ShreddedVariantFieldArray>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we continue this discussion in the other comment thread #8122 (comment), where I outlined the pros and cons of this struct?
/// TODO: How would a caller request a struct or list type where the fields/elements can be any | ||
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would | ||
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or | ||
/// list and then try to assemble the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one way to handle this case would be to extract the fields your code wants to work with as individual arrays, rather than trying to get back a single StructArray with the relevant fields
Agreed, but that would be pretty inefficient if each individual leaf column needs a separate variant_get
call that has to repeat the work of pathing through common path prefixes. I suppose the caller could take on the work of memoizing common path prefixes, but then they're creating a lot of intermediate arrays. Not just annoying, but also slow because it forces columnar access for row-wise pathing.
We could potentially define a bulk API like variant_get_many
, but then we would need to somehow identify the common paths, deduplicate them, and then recursively process them... and that internal recursion would still need a way to say "just give me whatever is at the end of each path". If we found an elegant way to communicate that internally, we could just make that solution the public API and be done with it.
/// TODO: How would a caller request a struct or list type where the fields/elements can be any | ||
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would | ||
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or | ||
/// list and then try to assemble the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can actually punt on casting the result to a proper struct array recursively
Assuming the caller did, in fact, ask for a fully strongly typed struct (no variant leaf fields), I'm pretty sure this PR already covers the solution? Even if they ask for a variant leaf field, they just pay a performance penalty to shred/unshred/reshred each field if the underlying shredding doesn't match the requested shredding.
/// TODO: How would a caller request a struct or list type where the fields/elements can be any | ||
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would | ||
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or | ||
/// list and then try to assemble the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, an idea!
What if we introduce a companion utility, variant_get_schema
, which instead of returning an ArrayRef
returns a DataType
?
That way, a caller who wants to extract one or more variant fields as part of a larger struct can fetch the actual variant shredding schema for each such field and splice it into the data type they pass to variant_get
.
For example, suppose I want to work with the following logical schema:
STRUCT {
a: STRUCT {
id: INT,
payload: VARIANT,
},
b: STRUCT {
key: STRING,
payload: VARIANT,
},
}
Then I could make two calls to variant_get_schema
followed by a (now precisely targeted) call to variant_get
:
// hand-waving the path argument for simplicity...
let a_payload_type = variant_get_schema(input, &["a", "payload"])?;
let b_payload_type = variant_get_schema(input, &["b", "payload"])?;
let a_type = DataType::Struct(Fields::from([
Field::new("id", ...),
Field::new("payload", a_payload_type, ...),
]));
let b_type = DataType::Struct(Fields::from([
Field::new("key", ...),
Field::new("payload", b_payload_type, ...),
]));
let data_type = DataType::Struct(Fields::from([
Field::new("a", a_type, ...),
Field::new("b", b_type, ...),
]));
let data = variant_get(input, &[], Some(data_type))?;
If the path ended on a shredded variant, variant_get_schema
would return the corresponding shredding schema, allowing variant_get
to return the corresponding sub-arrays unchanged. Otherwise, it would just return a binary variant schema, since variant_get
would anyway have to extract the bytes row by row.
// We need a way to convert a Variant directly to bytes. In particular, we want to just copy | ||
// across the underlying value byte slice of a `Variant::Object` or `Variant::List`, without | ||
// any interaction with a `VariantMetadata` (because we will just reuse the existing one). | ||
// | ||
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a | ||
// lot of unnecessary work and would also create a new metadata column we don't need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thought! I had forgotten about VariantBuilderExt
. It almost solves the issue with building up (or filtering down) partially shredded objects -- its new_object
method could return a builder used to add the desired fields. Except (a) ObjectBuilder
is a concrete type, not a trait, so we can't customize its behavior; and (b) that still needs to mess with string field names and metadata dictionaries. We'd really want to define a raw VariantObjectField
-- basically a (field_id, bytes) pair -- and have a method for appending that to the builder.
|
||
/// Used for actual shredding of binary variant values into shredded variant values | ||
#[allow(unused)] | ||
struct ShreddedVariantRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what it would be used for. If the user passed a shredded variant type to variant_get
and the path ends inside a binary variant, we have no choice but to shred accordingly (**) (***).
If the user just passed a concrete type to variant_get
, and the path ends inside a binary variant, we would instead use e.g. PrimitiveVariantShreddingRowBuilder
or StructVariantShreddingRowBuilder
.
(**) Things get complicated fast, if the user passed a shredded variant type to variant_get
and the path ends inside a shredded variant with a different schema. This pathfinding PR didn't even touch that case, but it would probably look vaguely similar to the struct specialization, where we try to follow the common/compatible paths of the two shredding schemas (in hopes of returning unmodified columns directly) and fall back to actual builders only where the paths diverge.
(***) If the user passed a binary variant schema and the underlying data were shredded, then we have to use a BinaryVariantRowBuilder
to insert the variant bytes produced by e.g. VariantArray::value
.
} | ||
} | ||
|
||
/// Builder that extracts a struct. Casting failures produce NULL or error according to options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataType::Struct
-> StructArray
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would | ||
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or | ||
/// list and then try to assemble the results. | ||
pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a call, ideas to make progress:
- Support
variant_get
for anyinput
(shredded or otherwise), any depth of object field path steps, and casting to e.g.Some(DataType::Int32)
. This should close the loop and is potentially a good candidate for first PR. - Support
variant_get
for any path and no data type. This would extract whatever variant sub-value is found at the path's end- Needs value byte copy capability
- Support
variant_get
forSome(DataType::Struct)
(nested shredding)- Needs read-only metadata builder capability
- Support
variant_get
forSome(DataType::VARIANT)
- Needs to build out the rest of the
VariantShreddingRowBuilder
implementations
- Needs to build out the rest of the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏 -- my plan is to try and turn these into tickets tomorrow morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have filed follow on tickets to track this work
- [Variant] Support Shredded Objects in
variant_get
: typed path access (STEP 1) #8150 - [Variant] Allow appending raw object/list bytes to variant builders #8141
- [Variant] Support Shredded Objects in
variant_get
: untyped path access #8151 - [Variant] Support creating Variants with pre-existing Metadata #8152
- [Variant] Support Shredded Objects in
variant_get
: access asSome(DataType::Struct)
(nested shredding) #8153 - [Variant] Support Shredded Objects in
variant_get
: Access asVARIANT
(Unshredding) #8154
# Which issue does this PR close? - Closes #8150 - closes #8166 # Rationale for this change Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches. NOTE: This is a second attempt at * #8166 which suffered strong logical conflicts with * #8179 and which itself grew out of * #8122 See the other two PR for the vast majority of review commentary relating to this change. I started from the original PR commits (first three), performed the merge, and fixed up a bunch of issues. Manually diffing the before ([76b75ee..882aa4d](https://github.com/apache/arrow-rs/compare/76b75eebc..882aa4d69)) and after ([0ba91ae..f6fd915](https://github.com/apache/arrow-rs/compare/0ba91aed9..f6fd91583)) diffs gives the following non-trivial differences vs. the original PR * Ran `cargo fmt` * `typed_value_to_variant` now supports all primitive numeric types (previously only int16) * cast options plumbed through and respected * Fix a null buffer bug in `shredded_get_path` -- the original code was wrongly unioning in the null buffer from `typed_value` column: ```patch // Path exhausted! Create a new `VariantArray` for the location we landed on. - // Also union nulls from the final typed_value field we landed on - if let Some(typed_value) = shredding_state.typed_value_field() { - accumulated_nulls = arrow::buffer::NullBuffer::union( - accumulated_nulls.as_ref(), - typed_value.nulls(), - ); - } let target = make_target_variant( shredding_state.value_field().cloned(), shredding_state.typed_value_field().cloned(), accumulated_nulls, ); ``` * Remove the `get_variant_perfectly_shredded_int32_as_variant` test case, because #8179 introduced a battery of unit tests that cover the same functionality. * Remove now-unnecessary `.unwrap()` calls from object builder `finish` calls in unit tests * Fixed broken test code in `create_depth_1_shredded_test_data_working`, which captured the return value of a nested builder's `finish` (`()`) instead of the return value of the top-level builder. I'm not quite sure what this code was trying to do, but I changed it to just create a nested builder instead of a second top-level builder: ```patch fn create_depth_1_shredded_test_data_working() -> ArrayRef { // Create metadata following the working pattern from shredded_object_with_x_field_variant_array let (metadata, _) = { - let a_variant = { - let mut a_builder = parquet_variant::VariantBuilder::new(); - let mut a_obj = a_builder.new_object(); - a_obj.insert("x", Variant::Int32(55)); // "a.x" field (shredded when possible) - a_obj.finish().unwrap() - }; let mut builder = parquet_variant::VariantBuilder::new(); let mut obj = builder.new_object(); - obj.insert("a", a_variant); + + // Create the nested "a" object + let mut a_obj = obj.new_object("a"); + a_obj.insert("x", Variant::Int32(55)); + a_obj.finish(); + obj.finish().unwrap(); builder.finish() }; ``` * Similar fix (twice, `a_variant` and `b_variant`) for `create_depth_2_shredded_test_data_working` * `make_shredding_row_builder` now supports signed int and float types (unsigned int not supported yet) * A new `get_type_name` helper in row_builder.rs that gives human-readable data type names. I'm not convinced it's necessary (and the code is in the wrong spot, jammed in the middle of `VariantAsPrimitive` code. * `impl VariantAsPrimitive` for all signed int and float types * `PrimitiveVariantShreddingRowBuilder` now has a lifetime param because it takes a reference to cast options (it now respects unsafe vs. safe casting) # What changes are included in this PR? Everything in the original PR, plus merge in the main branch, fix logical conflicts and fix various broken tests. # Are these changes tested? All unit tests now pass. # Are there any user-facing changes? No (variant is not public yet) --------- Co-authored-by: carpecodeum <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
A bit of pathfinding relating to
variant_get
and its connection to variant shredding.variant_get
#8083Intended as a conversation-starter, not targeting merge (too big, too incomplete, etc).