-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Add variant_get
and Shredded VariantArray
#8021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cd55763
to
adf552c
Compare
variant_get
for shredded variantsvariant_get
and Shredded variants
shredding_state: ShreddingState, | ||
} | ||
|
||
/// Variant arrays can be shredded in one of three states, encoded here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this enum it is easier to know what the state of any particular array is
let metadata = self.metadata_field().as_binary_view().value(index); | ||
let value = self.value_field().as_binary_view().value(index); | ||
Variant::new(metadata, value) | ||
pub fn value(&self, index: usize) -> Option<Variant> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first part -- accessing a single Variant value from a potentially shredded VariantArray
@@ -1,180 +0,0 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually left most of this file the same but renamed it to variant_get/mod.rs
, despite how github is rendering it
) | ||
})?; | ||
|
||
// Create the output writer based on the specified output options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core design for get is here: Different potential builders depending on the type of the output array -- which gives us a place to put special code for each output array type, this is the OutputBuilder
trait -- perhaps someone can come up with a better name.
); | ||
} | ||
|
||
/// Shredding: extract a value as a VariantArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are the new shredded tests
/// Trait for Arrow primitive types that can be used in the output builder | ||
/// | ||
/// This just exists to add a generic way to convert from Variant to the primitive type | ||
pub(super) trait ArrowPrimitiveVariant: ArrowPrimitiveType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is a bit complicated due to generics but I think it will work for all primitive types (ints, floats, etc)
It handles extracting strongly typed values from variants
_value_field: &BinaryViewArray, | ||
typed_value: &ArrayRef, | ||
) -> arrow::error::Result<ArrayRef> { | ||
// build up the output array element by element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the code that converts a shredded variant into a typed output -- It can probably be made quite a bit faster with some more deliberate vectorization, but I think it is relatively simple and functional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a minute to realize that "partially shredded" here does not refer to the variant shredding spec definition:
An object is partially shredded when the
value
is an object and thetyped_value
is a shredded object. Writers must not produce data where bothvalue
andtyped_value
are non-null, unless the Variant value is an object.
and
The value column of a partially shredded object must never contain fields represented by the Parquet columns in
typed_value
(shredded fields). Readers may always assume that data is written correctly and that shredded fields intyped_value
are not present invalue
.
This function seems to actually refer to "imperfectly shredded" columns, where we need a value
column as a fallback for input values that fail to shred? As opposed to "perfectly shredded" columns that only need a typed_value
column (or where the value
column is all-null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent call -- I have updated the names to match the shredding specification
typed_value: &ArrayRef, | ||
) -> arrow::error::Result<ArrayRef> { | ||
// if the types match exactly, we can just return the typed_value | ||
if typed_value.data_type() == self.as_type.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the fast path if we have perfectly shredded value
// in this case dispatch on the typed_value and | ||
// TODO macro'ize this using downcast! to handle all other primitive types | ||
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same) | ||
let mut array_builder = VariantArrayBuilder::new(variant_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the logic to reconstruct Variant
s from typed values
_metadata: &BinaryViewArray, | ||
_value_field: &BinaryViewArray, | ||
) -> arrow::error::Result<ArrayRef> { | ||
let mut builder = VariantArrayBuilder::new(variant_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case that is handled on main
adf552c
to
3ba9fe0
Compare
variant_get
and Shredded variantsvariant_get
and Shredded VariantArray
@scovich @carpecodeum and @Samyak2 this PR I think is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP review
let met = self.metadata_ref.slice(offset, length); | ||
let val = self.value_ref.slice(offset, length); | ||
let inner = self.inner.slice(offset, length); | ||
let shredding_state = self.shredding_state.slice(offset, length); | ||
Arc::new(Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid unnecessary allocations for common cases, like -
let shredding_state = match (&self.shredding_state, offset, length) {
// Fast path: no slice needed for full array
(state, 0, len) if len == self.len() => state.clone(),
// Fast path: uniform shredding doesn't need slicing
(ShreddingState::None, _, _) => ShreddingState::None,
(ShreddingState::All, _, _) => ShreddingState::All,
// Only slice for mixed case
(ShreddingState::Mixed(bitmap), offset, length) => {
ShreddingState::Mixed(bitmap.slice(offset, length))
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could definitely add a fast path for slicing the whole array.
I am not quite sure what ShreddingState::All and ShreddingState::Mixed is supposed to represent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, would we actually expect a redundant slice
call to be notably more expensive than clone
?
Seems like they'd do the same thing?
For example, BooleanArray::slice
is:
pub fn slice(&self, offset: usize, length: usize) -> Self {
Self {
values: self.values.slice(offset, length),
nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
}
}
with BooleanBuffer::slice
:
pub fn slice(&self, offset: usize, len: usize) -> Self {
assert!(
offset.saturating_add(len) <= self.len,
"the length + offset of the sliced BooleanBuffer cannot exceed the existing length"
);
Self {
buffer: self.buffer.clone(),
offset: self.offset + offset,
len,
}
}
Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I don't expect any measurable performance improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice!
I only had one concern around paths. Other comments are nitpicks.
/// returns the non-null element at index as a Variant | ||
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | ||
match typed_value.data_type() { | ||
DataType::Int32 => { | ||
let typed_value = typed_value.as_primitive::<Int32Type>(); | ||
Variant::from(typed_value.value(index)) | ||
} | ||
// todo other types here | ||
_ => { | ||
todo!(); // Unsupported typed_value type | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be useful as a (public) function on its own. For example, this could be used to cast columns to variant like in some_column::variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea -- I am working to pull it into its own cast_to_variant function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time on this, and I found it was not as simple as I would have liked. Thus I have filed a ticket to track the idea
I filed a ticket here
I will pull that code out into its own PR for easier review as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I couldn't quite figure out how to reuse that kernel code in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate. Maybe a separate issue for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought more about this last night -- the reason I think we'll need special code other than cast_to_variant
is that cast_to_variant
does the downcast to typed array once, and then loops with the downcast array, where this code needs to downcast once and then convert 🤔 maybe someone can figure out something clever with macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had come up with some ideas before I remembered downcast_to_primitive!
macro, but I think that would be the approach to take. Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.
Things get a lot trickier if we want to eliminate references to data types, tho, as the downcast_to_primitive!
experiment in the other PR showed. We need a lot of homogeneity, or at least to know very clearly what data types are not covered so we can add specific match arms for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.
I think this is a good idea. I am not sure we can use the downcast_to_primitive
as it dispatches based on underlying primtive type (e.g. you get an i128
for a decimal, but don't get precision and scale). Maybe someone else can figure out how to make it work
We could maybe make our own macro variant specific macro though 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate. Maybe a separate issue for that?
|
||
// if the typed value is null, decode the variant and extract the value | ||
if typed_value.is_null(i) { | ||
// todo follow path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one concern with this. I think the path-ing should be done independent of the output builder. Paths are only valid on objects/lists inside the variant. I don't see why every type's output builder needs to be concerned with the path.
Even for shredded objects, we can bring the nested value/typed_value fields up and pass it along like a top-level shredded variant array.
What do you think? Let me know if I missed something (I wasn't actively following the discussions around this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are probably right that the path handling should be handled elsewhere I haven't worked out exactly how extracting paths would be implemented. Maybe that would be a good follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.
Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.
Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)
Right, once we get into "true" (unshredded) variant data, we are stuck going row by row. But we can handle the pathing the the (possibly empty) prefix of perfectly shredded columns on a column-wise basis. Identify them when constructing the builders, and then just "snap" them into place when finalizing the builder. No per-row shenanigans needed. If a path isn't perfectly shredded, then the builder needs to take the variant column as input and finish the pathing row by row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, suppose that the user requested path v:a.b.c.d::INT
, and that a
and b
were shredded but not c
. Then the corresponding builder would take v.typed_value.a.typed_value.b.value
as its input column, with c.d::INT
as the path to extract on a per-row basis. And then its finalizer would return the resulting (now shredded) ArrayRef
for the parent builder for v:a.b
to snap into place.
Meanwhile, if the user also requested v:a.b.x.y::INT
which happens to be perfectly shredded, then the corresponding "builder" would take v.typed_value.a.typed_value.b.typed_value.x.typed_value.y.typed_value
as its input column. It would not need to extract anything (so its per-row operation is a no-op) and its finalizer just needs to return the unchanged input ArrayRef
so the parent builder for v:a.b
can "snap" it into the correct place while assembling the StructArray
that represents v:a.b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- I will be honest I didn't follow the entire thing but it sounds like it is on the right track to me
Given the length and size of this PR and the comments, I plan to merge it now and i think we'll end up working these questions out as we implement object shredding, under the aegis of
Whoop, I somehow missed the notify. I'll try to take a look in the next day or two! |
No worries -- I'll address @carpecodeum and @Samyak2 's comments in the mean time |
Co-authored-by: Samyak Sarnayak <[email protected]>
Ping or request a review when ready? |
222eb62
to
924155c
Compare
I think this is ready for review - it has not substantially changed |
To be clear, my ideal outcome would be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running out of time for today, so will flush what little I have so far.
Tomorrow is another day.
let Some(metadata) = metadata_field.as_binary_view_opt() else { | ||
return Err(ArrowError::NotYetImplemented(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Out of curiosity, do we have any style/preference guidelines for
let Some(foo) = foo_opt else {
return Err(...);
};
vs.
let foo = foo_opt.unwrap_or_else(|| {
...
})?;
?
The former seems both more direct and less verbose, at least in this case?
Maybe unwrap_or_else
is mostly (only?) useful as part of a bigger monadic chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer this pattern for the reasons you suggest. However, I don't know of anything formal we have discussed or agreed on
let Some(foo) = foo_opt else {
return Err(...);
};
match &self.shredding_state { | ||
ShreddingState::Unshredded { metadata, value } => { | ||
Variant::new(metadata.value(index), value.value(index)) | ||
} | ||
ShreddingState::FullyShredded { | ||
metadata: _, | ||
typed_value, | ||
} => { | ||
if typed_value.is_null(index) { | ||
Variant::Null | ||
} else { | ||
typed_value_to_variant(typed_value, index) | ||
} | ||
} | ||
ShreddingState::PartiallyShredded { | ||
metadata, | ||
value, | ||
typed_value, | ||
} => { | ||
if typed_value.is_null(index) { | ||
Variant::new(metadata.value(index), value.value(index)) | ||
} else { | ||
typed_value_to_variant(typed_value, index) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing value
and typed_value
as options? Especially since the shredding spec seems to suggest that none-none case is a valid combination?
required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }
and
(the file should be able to omit both physical columns, in case of a perfectly shredded all-null the logical column)
match &self.shredding_state { | |
ShreddingState::Unshredded { metadata, value } => { | |
Variant::new(metadata.value(index), value.value(index)) | |
} | |
ShreddingState::FullyShredded { | |
metadata: _, | |
typed_value, | |
} => { | |
if typed_value.is_null(index) { | |
Variant::Null | |
} else { | |
typed_value_to_variant(typed_value, index) | |
} | |
} | |
ShreddingState::PartiallyShredded { | |
metadata, | |
value, | |
typed_value, | |
} => { | |
if typed_value.is_null(index) { | |
Variant::new(metadata.value(index), value.value(index)) | |
} else { | |
typed_value_to_variant(typed_value, index) | |
} | |
} | |
} | |
// Always prefer typed_value over value (if present). | |
match self.typed_value.as_ref().and_then(|tv| typed_value_to_variant(tv, index)) { | |
Some(typed_value) => typed_value, | |
None => self.value.as_ref().map_or( | |
Variant::Null, | |
|v| Variant::new(self.metadata.value(index), v.value(index)), | |
)), | |
} |
(this assumes that typed_value_to_variant
includes the null check and returns Option<Variant>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing
value
andtyped_value
as options?
The reason I first introduced the enum was that in variant_get
I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.
Another reason reason was that I ended up with redundant error checking that was in the VariantArray::try_new
(though if perfectly null arrays are allowed, maybe this becomes irrelevant)
Especially since the shredding spec seems to suggest that none-none case is a valid combination?
I personally think having names for the different shredding types makes working with the code easier, even if all 4 combinations are valid.
I think you are right, however, that the shredding spec allows both to be None. I double checked the Arrow Proposal as it also implies both being null is permitted:
An optional field named value that is binary, large_binary, or binary_view
An optional field named typed_value which can be any primitive type or be a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal is to leave the ShreddingState enum in and I will file a follow on ticket to support the none-none
case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I first introduced the enum was that in
variant_get
I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.
I guess... but it seems like there's so much code just managing the resulting enum variants that I question whether it's a net win. Like the getters for metadata, value, etc. or the above code that shrinks from 25 lines to seven while handling more cases.
Clearly either way can be made correct, tho, so I guess it's a matter of preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree it is a matter of preference. If someone else feels strongly, I am not opposed to changing it
self.shredding_state.metadata_field() | ||
} | ||
|
||
fn find_value_field(array: &StructArray) -> Option<ArrayRef> { | ||
array.column_by_name("value").cloned() | ||
/// Return a reference to the value field of the `StructArray` | ||
pub fn value_field(&self) -> Option<&BinaryViewArray> { | ||
self.shredding_state.value_field() | ||
} | ||
|
||
/// Return a reference to the metadata field of the [`StructArray`] | ||
pub fn metadata_field(&self) -> &ArrayRef { | ||
// spec says fields order is not guaranteed, so we search by name | ||
&self.metadata_ref | ||
/// Return a reference to the typed_value field of the `StructArray`, if present | ||
pub fn typed_value_field(&self) -> Option<&ArrayRef> { | ||
self.shredding_state.typed_value_field() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These become trivial if we ditch the ShreddingState
enum
} | ||
|
||
/// returns the non-null element at index as a Variant | ||
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to make this return Option<Variant>
so callers don't have to check for null themselves.
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | |
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant { | |
if typed_value.is_null(index) { | |
return None; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other Arrow Array apis don't return Option
, they instead return the value directly requiring you to check is_null
instead -- see https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value for example
/// Consistently with other Arrow arrays types, this API requires you to
/// check for nulls first using [`Self::is_valid`].
I think the reason it to allow a better chance at LLVM vectorizing the code, which I don't think is likely relevant here.
We could make the Variant API deviate from this pattern (it is already different in several other ways) and return Option<Variant>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other Arrow Array apis don't return
Option
, they instead return the value directly requiring you to checkis_null
instead
For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.
We can always change it if we discover it hurts performance or readability.
I think the reason it to allow a better chance at LLVM vectorizing the code
AFAIK, arrow normally requires all array entries to be physically valid, even when logically null. That way, one can perform columnar operations blindly and then just use the null masks to ignore the unwanted values after the fact. Here, we're accessing is_null
and value
both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.
Sorry -- I wasn't clear -- the reason I was talking about VariantArray::value
is that it is the only caller of typed_value_to_variant
so if we return Option
from this value, we would just be stuck at the same place 🤔
Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.
Yeah I agree I don't think it will make any difference for performance with Cariants. The primary reason in my mind is to be consistent with other APIs.
I think in an earlier version of this PR I actually had changed value()
return Option
. Maybe changing the signature is a good idea 🤔
match typed_value.data_type() { | ||
DataType::Int32 => { | ||
let typed_value = typed_value.as_primitive::<Int32Type>(); | ||
Variant::from(typed_value.value(index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good use for downcast_primitive_array macro?
match typed_value.data_type() { | |
DataType::Int32 => { | |
let typed_value = typed_value.as_primitive::<Int32Type>(); | |
Variant::from(typed_value.value(index)) | |
downcast_primitive_array! { | |
typed_value => Some(Variant::from(typed_value.value(index))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make that work as it needs more information than just the native type (e.g. the precision/scale for decimals).
@superserious-dev 's cast_conversion
macro I think can be adapted eventually to handle this case
(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from( | ||
"VariantArray has neither value nor typed_value field", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the shredding spec allows this case. It corresponds to a perfectly shredded all-null column where the writer chose to omit both of them as unnecessary.
(see other comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- I'll file a follow on ticket to add support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Ryan Johnson <[email protected]>
looks great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped a bunch of thoughts about nesting and pathing. Hopefully helpful.
&self.value_ref | ||
/// Variant arrays can be shredded in one of three states, encoded here | ||
#[derive(Debug)] | ||
pub enum ShreddingState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a future FullyShreddedAllNull
variant (neither value
nor typed_value
present), would we still need to store the metadata
even tho it's never actually used? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
I filed a ticket to track adding AllNull:
let met = self.metadata_ref.slice(offset, length); | ||
let val = self.value_ref.slice(offset, length); | ||
let inner = self.inner.slice(offset, length); | ||
let shredding_state = self.shredding_state.slice(offset, length); | ||
Arc::new(Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, would we actually expect a redundant slice
call to be notably more expensive than clone
?
Seems like they'd do the same thing?
For example, BooleanArray::slice
is:
pub fn slice(&self, offset: usize, length: usize) -> Self {
Self {
values: self.values.slice(offset, length),
nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
}
}
with BooleanBuffer::slice
:
pub fn slice(&self, offset: usize, len: usize) -> Self {
assert!(
offset.saturating_add(len) <= self.len,
"the length + offset of the sliced BooleanBuffer cannot exceed the existing length"
);
Self {
buffer: self.buffer.clone(),
offset: self.offset + offset,
len,
}
}
Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.
|
||
// if the typed value is null, decode the variant and extract the value | ||
if typed_value.is_null(i) { | ||
// todo follow path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?
|
||
// if the typed value is null, decode the variant and extract the value | ||
if typed_value.is_null(i) { | ||
// todo follow path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.
// Dispatch based on the shredding state of the input variant array | ||
match variant_array.shredding_state() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (big) complexity here -- the requested type could differ wildly from how the input data are currently shredded.
Pathing-wise, we'd need to drill as deeply as possible into the variant array's physical shredding schema, following the requested path. In the happy path, the requested path was shredded and we just have to deal with the corresponding (value, typed_value)
pair of columns.
Example
Suppose we're trying to extract v:a.b::INT
from a VariantArray
with the following physical layout:
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: STRUCT {
a: SHREDDED_VARIANT {
value: BINARY,
typed_value: STRUCT {
b: SHREDDED_VARIANT {
value: BINARY,
typed_value: INT,
},
},
},
},
}
Then we're actually doing v.typed_value.a.typed_value.b::INT
-- the builder only has to worry about reconciling v.typed_value.a.typed_value.b.value
with v.typed_value.a.typed_value.b.typed_value
(which can be passed in directly at construction time)
This case has an even easier example, if the caller requested v:a
(an untyped variant extract) -- in that case we simply combine v.typed_value.a
with v.metadata
into a new VariantArray -- no casting needed because VariantArray
can capture the full structure v:a
no matter how it was physically shredded.
In a less-happy path, we will hit a partially shredded object where the requested path element is not available in typed_value
, and we must go row-oriented from there (with the remaining path extraction performed on a per-row basis):
Example
Suppose we're trying to extract v:a.b::INT
from a VariantArray
with the following physical layout:
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: STRUCT {
a: SHREDDED_VARIANT {
value: BINARY,
typed_value: STRUCT {
x: SHREDDED_VARIANT {
value: BINARY,
typed_value: INT,
},
},
},
},
}
Then we're actually doing v.typed_value.a:b::INT
-- we have to pass v.typed_value.a.value
into the builder, which has to extract field b
from every row.
In an unhappy path, the partial shredding of the input disagrees with the output's requested shredding, and we have to build up a struct from both shredded and unshredded paths.
Example
Suppose the requested schema asks for both v:a.b::INT
and v:a.x::INT
, when the VariantArray's layout is:
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: STRUCT {
a: SHREDDED_VARIANT {
value: BINARY,
typed_value: STRUCT {
b: SHREDDED_VARIANT {
value: BINARY,
typed_value: INT,
},
},
},
},
}
Now we have to actually build the struct v.a
with two fields by unioning shredded (v.typed_value.a.typed_value.b::INT
and unshredded bits (v.typed_value.a.value:x::INT
).
In a really unhappy case, the caller requested a shredded variant type that disagrees with the input data's shredding. So some fields have to be shredded and others unshredded.
Example
Suppose we're trying to extract both v:a
as a partially shredded variant with x
shredded and b
not shredded... from a VariantArray
where b
is shredded and x
is not shredded:
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: STRUCT {
a: SHREDDED_VARIANT {
value: BINARY,
typed_value: STRUCT {
b: SHREDDED_VARIANT {
value: BINARY,
typed_value: INT,
},
},
},
},
}
Then not only would we need to path into v.typed_value.a.typed_value.b::VARIANT
(which requires unshredding it and which produces an o.typed_value.a.value_b
column), we'd also have to path into v.typed_value.a.value::INT
(which requires shredding it and which contributes to both o.typed_value.a.value_x
and o.typed_value.a.typed_value
columns). And then we have to variant-union o.typed_value.a.value_b
and o.typed_value.a.value_x
columns into a single o.typed_value.a.value
column, and pack everything into a new VariantArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly, this I think is the key challenge of this code -- each different type of output needs to handle all the possible input shredding types as well.
/// if `as_type` is None, the returned array will itself be a VariantArray. | ||
/// | ||
/// if `as_type` is `Some(type)` the field is returned as the specified type. | ||
pub as_type: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably None
and Some(VariantType)
are equivalent?
If so, should we just default-initialize as_type: FieldRef
to VariantType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... but variant is not a single type... every possible shredding (or lack thereof) is a different physical DataType
. So by passing None
, we request to keep whatever structure the underlying path already has; by passing Some(... variant type...)
we're requesting to [un|re]shred as necessary to make it fit the new shredding schema.
/// Construct options to get the specified path as a variant. | ||
pub fn new_with_path(path: VariantPath<'a>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess path is optional (defaults to empty) because a leaf extract of a perfectly shredded variant doesn't need any internal pathing -- whoever constructs the builder just passes the leaf columns and the builder only needs to cast them appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the output builder seems to be fully column-oriented -- it assumes that all values for each leaf column are extracted in a tight loop. This can work for primitive builders, but nested builders will quickly run into pathing and efficiency problems.
I think we'll need to do something similar to the JSON builder, with a row-oriented approach where each level of a nested builder receives an already-constructed Variant
for the current row and does a field extract for each child builder; child builders can then cast the result directly or recurse further as needed (based on their own type). And then the top-level builder call would construct a Variant
for each row to kick-start the process.
But see the other comment -- to the extend that the shredding aligns nicely, we can hoist a subset of this per-row pathing of the append method up to columnar pathing of the builder's constructor and finalizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this aligns with my thinking and is nicely described -- I will use it as the base for the next set of tickets perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Ryan Johnson <[email protected]>
…rrow-rs into alamb/variant_shredding_strawman
I have gone though the comments in this PR -- thank you @carpecodeum @Samyak2 and @scovich I would very much like to unblock other people working on various parts of variant_get in parallel. Thus, my plans now are:
|
|
||
// todo for other primitive types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// todo for other primitive types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I filed a bunch more tickets now
|
||
// if the typed value is null, decode the variant and extract the value | ||
if typed_value.is_null(i) { | ||
// todo follow path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: try to cast the typed_value to the desired type? | ||
Err(ArrowError::NotYetImplemented(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if typed_value.data_type() == self.as_type.data_type() { | ||
Ok(typed_value.clone()) | ||
} else { | ||
// TODO: try to cast the typed_value to the desired type? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: try to cast the typed_value to the desired type? | |
// TODO: try to cast the typed_value to the desired type? | |
// https://github.com/apache/arrow-rs/issues/8086 |
} | ||
} | ||
dt => { | ||
return Err(ArrowError::NotYetImplemented(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Filed [Variant] Casting errors behavior support in
variant_get
#8086 for the remaining types
return Err(ArrowError::NotYetImplemented(format!( | |
// https://github.com/apache/arrow-rs/issues/8086 | |
return Err(ArrowError::NotYetImplemented(format!( |
/// [Parquet Variant Shredding Spec]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding | ||
#[derive(Debug)] | ||
pub enum ShreddingState { | ||
// TODO: add missing state where there is neither value nor typed_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: add missing state where there is neither value nor typed_value | |
// TODO: add missing state where there is neither value nor typed_value | |
// https://github.com/apache/arrow-rs/issues/8088 |
&self.value_ref | ||
/// Variant arrays can be shredded in one of three states, encoded here | ||
#[derive(Debug)] | ||
pub enum ShreddingState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
I filed a ticket to track adding AllNull:
(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from( | ||
"VariantArray has neither value nor typed_value field", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variant_get
and Shredded VariantArray
variant_get
and Shredded VariantArray
This implementation is far from complete, but I don't see any objections to the fundamental structure. So, let's merge this one in and work on improving the situation in follow on PRs |
Whops, I forgot to add some links. I will make a new PR for that |
|
# Which issue does this PR close? - Follow on to #8021 # Rationale for this change Let's add links to the relevant tickets in the code so future readers who encounter it can find the relevant context # What changes are included in this PR? Add comments with links to tickets # Are these changes tested? N/A (just comments) # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Related to #8021 # Rationale for this change As part of the review in #8021, @scovich and I were discussing how `VariantArray::value` should behave in the presence of nulls: #8021 (comment) > Suggest to make this return Option<Variant> so callers don't have to check for null themselves. I realized it might not be 100% clear that the existing convention in this crate was that `value()` methods did not check for nulls / return `Option`. I think we should document it better # What changes are included in this PR? Explicitly document that `value` methods do not check for nulls and explain what happens when they are used on null values # Are these changes tested? Yes, by CI # Are there any user-facing changes? Additional documentation. No behavior changes --------- Co-authored-by: Kyle Barron <[email protected]>
Which issue does this PR close?
variant_get
kernel for shredded variants #7941Rationale for this change
This is has a proposal for how to structure shredded
VariantArray
s and thevariant_get
kernelIf people like the basic idea I will file some more tickets to track additional follow on work
It is based on ideas ideas from @carpecodeum in #7946 and @scovich in #7915
I basically took the tests from #7965 and the conversation with @scovich recorded from #7941 (comment) and I bashed out how this might look
What changes are included in this PR?
VariantArray
to represent shreddingvariant_get
to support extracting paths as both variants and typed fieldsNote there are many things that are NOT in this PR that I envision doing as follow on PRs:
Path
sStringArray
,StringViewArray
, etcAre these changes tested?
Yes
Are there any user-facing changes?
New feature