[Variant] Add `variant_get` and Shredded `VariantArray` #8021

alamb · 2025-07-29T21:27:58Z

Which issue does this PR close?

Rationale for this change

This is has a proposal for how to structure shredded VariantArrays and the variant_get kernel

If people like the basic idea I will file some more tickets to track additional follow on work

It is based on ideas ideas from @carpecodeum in #7946 and @scovich in #7915

I basically took the tests from #7965 and the conversation with @scovich recorded from #7941 (comment) and I bashed out how this might look

What changes are included in this PR?

Update VariantArray to represent shredding
Add code to variant_get to support extracting paths as both variants and typed fields
A pattern that I think can represent shredding and extraction
Tests for same

Note there are many things that are NOT in this PR that I envision doing as follow on PRs:

Support and implementing Paths
Support for shredded objects
Support shredded lists
Support nested objects / lists
Full casting support
Support for other output types: StringArray, StringViewArray, etc
Many performance improvements

Are these changes tested?

Yes

Are there any user-facing changes?

New feature

parquet-variant-compute/src/variant_get.rs

alamb · 2025-08-02T18:11:17Z

parquet-variant-compute/src/variant_array.rs

+    shredding_state: ShreddingState,
+}
+
+/// Variant arrays can be shredded in one of three states, encoded here


With this enum it is easier to know what the state of any particular array is

alamb · 2025-08-02T18:12:10Z

parquet-variant-compute/src/variant_array.rs

-        let metadata = self.metadata_field().as_binary_view().value(index);
-        let value = self.value_field().as_binary_view().value(index);
-        Variant::new(metadata, value)
+    pub fn value(&self, index: usize) -> Option<Variant> {


This is the first part -- accessing a single Variant value from a potentially shredded VariantArray

alamb · 2025-08-02T18:12:54Z

parquet-variant-compute/src/variant_get.rs

@@ -1,180 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one


I actually left most of this file the same but renamed it to variant_get/mod.rs, despite how github is rendering it

alamb · 2025-08-02T18:16:48Z

parquet-variant-compute/src/variant_get/mod.rs

+        )
+    })?;
+
+    // Create the output writer based on the specified output options


The core design for get is here: Different potential builders depending on the type of the output array -- which gives us a place to put special code for each output array type, this is the OutputBuilder trait -- perhaps someone can come up with a better name.

alamb · 2025-08-02T18:18:36Z

parquet-variant-compute/src/variant_get/mod.rs

+        );
+    }
+
+    /// Shredding: extract a value as a VariantArray


here are the new shredded tests

alamb · 2025-08-02T18:20:01Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+/// Trait for Arrow primitive types that can be used in the output builder
+///
+/// This just exists to add a generic way to convert from Variant to the primitive type
+pub(super) trait ArrowPrimitiveVariant: ArrowPrimitiveType {


this one is a bit complicated due to generics but I think it will work for all primitive types (ints, floats, etc)

It handles extracting strongly typed values from variants

alamb · 2025-08-02T18:21:17Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+        _value_field: &BinaryViewArray,
+        typed_value: &ArrayRef,
+    ) -> arrow::error::Result<ArrayRef> {
+        // build up the output array element by element


This is the code that converts a shredded variant into a typed output -- It can probably be made quite a bit faster with some more deliberate vectorization, but I think it is relatively simple and functional

It took a minute to realize that "partially shredded" here does not refer to the variant shredding spec definition:

An object is partially shredded when the value is an object and the typed_value is a shredded object. Writers must not produce data where both value and typed_value are non-null, unless the Variant value is an object.

and

The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value.

This function seems to actually refer to "imperfectly shredded" columns, where we need a value column as a fallback for input values that fail to shred? As opposed to "perfectly shredded" columns that only need a typed_value column (or where the value column is all-null).

This is an excellent call -- I have updated the names to match the shredding specification

alamb · 2025-08-02T18:23:34Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+        typed_value: &ArrayRef,
+    ) -> arrow::error::Result<ArrayRef> {
+        // if the types match exactly, we can just return the typed_value
+        if typed_value.data_type() == self.as_type.data_type() {


here is the fast path if we have perfectly shredded value

alamb · 2025-08-02T18:24:42Z

parquet-variant-compute/src/variant_get/output/variant.rs

+        // in this case dispatch on the typed_value and
+        // TODO macro'ize this using downcast! to handle all other primitive types
+        // TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same)
+        let mut array_builder = VariantArrayBuilder::new(variant_array.len());


This is the logic to reconstruct Variants from typed values

alamb · 2025-08-02T18:25:21Z

parquet-variant-compute/src/variant_get/output/variant.rs

+        _metadata: &BinaryViewArray,
+        _value_field: &BinaryViewArray,
+    ) -> arrow::error::Result<ArrayRef> {
+        let mut builder = VariantArrayBuilder::new(variant_array.len());


This is the case that is handled on main

alamb · 2025-08-02T19:41:05Z

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

carpecodeum

WIP review

parquet-variant-compute/src/from_json.rs

carpecodeum · 2025-08-03T07:30:40Z

parquet-variant-compute/src/variant_array.rs

-        let met = self.metadata_ref.slice(offset, length);
-        let val = self.value_ref.slice(offset, length);
+        let inner = self.inner.slice(offset, length);
+        let shredding_state = self.shredding_state.slice(offset, length);
        Arc::new(Self {


can we avoid unnecessary allocations for common cases, like -

let shredding_state = match (&self.shredding_state, offset, length) { // Fast path: no slice needed for full array (state, 0, len) if len == self.len() => state.clone(), // Fast path: uniform shredding doesn't need slicing (ShreddingState::None, _, _) => ShreddingState::None, (ShreddingState::All, _, _) => ShreddingState::All, // Only slice for mixed case (ShreddingState::Mixed(bitmap), offset, length) => { ShreddingState::Mixed(bitmap.slice(offset, length)) } };

We could definitely add a fast path for slicing the whole array.

I am not quite sure what ShreddingState::All and ShreddingState::Mixed is supposed to represent

Out of curiosity, would we actually expect a redundant slice call to be notably more expensive than clone?
Seems like they'd do the same thing?

For example, BooleanArray::slice is:

pub fn slice(&self, offset: usize, length: usize) -> Self { Self { values: self.values.slice(offset, length), nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)), } }

with BooleanBuffer::slice:

pub fn slice(&self, offset: usize, len: usize) -> Self { assert!( offset.saturating_add(len) <= self.len, "the length + offset of the sliced BooleanBuffer cannot exceed the existing length" ); Self { buffer: self.buffer.clone(), offset: self.offset + offset, len, } }

Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.

I agree I don't expect any measurable performance improvement

Samyak2

This is nice!
I only had one concern around paths. Other comments are nitpicks.

parquet-variant-compute/src/variant_array.rs

Samyak2 · 2025-08-03T11:01:13Z

parquet-variant-compute/src/variant_array.rs

+/// returns the non-null element at index as a Variant
+fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
+    match typed_value.data_type() {
+        DataType::Int32 => {
+            let typed_value = typed_value.as_primitive::<Int32Type>();
+            Variant::from(typed_value.value(index))
+        }
+        // todo other types here
+        _ => {
+            todo!(); // Unsupported typed_value type
+        }
    }
 }


This could be useful as a (public) function on its own. For example, this could be used to cast columns to variant like in some_column::variant

This is a great idea -- I am working to pull it into its own cast_to_variant function

I spent some time on this, and I found it was not as simple as I would have liked. Thus I have filed a ticket to track the idea

I filed a ticket here

[Variant] Implement cast_to_variant kernel #8043

I will pull that code out into its own PR for easier review as well

New PR

implement cast_to_variant kernel to cast native types to VariantArray #8044

Note I couldn't quite figure out how to reuse that kernel code in this PR

That's unfortunate. Maybe a separate issue for that?

I thought more about this last night -- the reason I think we'll need special code other than cast_to_variant is that cast_to_variant does the downcast to typed array once, and then loops with the downcast array, where this code needs to downcast once and then convert 🤔 maybe someone can figure out something clever with macros

I had come up with some ideas before I remembered downcast_to_primitive! macro, but I think that would be the approach to take. Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.

Things get a lot trickier if we want to eliminate references to data types, tho, as the downcast_to_primitive! experiment in the other PR showed. We need a lot of homogeneity, or at least to know very clearly what data types are not covered so we can add specific match arms for them.

Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.

I think this is a good idea. I am not sure we can use the downcast_to_primitive as it dispatches based on underlying primtive type (e.g. you get an i128 for a decimal, but don't get precision and scale). Maybe someone else can figure out how to make it work

We could maybe make our own macro variant specific macro though 🤔

That's unfortunate. Maybe a separate issue for that?

Filed [Variant] Implement VariantArray::value for shredded variants #8091

parquet-variant-compute/src/variant_get/mod.rs

Samyak2 · 2025-08-03T11:17:04Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+
+            // if the typed value is null, decode the variant and extract the value
+            if typed_value.is_null(i) {
+                // todo follow path


I have one concern with this. I think the path-ing should be done independent of the output builder. Paths are only valid on objects/lists inside the variant. I don't see why every type's output builder needs to be concerned with the path.

Even for shredded objects, we can bring the nested value/typed_value fields up and pass it along like a top-level shredded variant array.

What do you think? Let me know if I missed something (I wasn't actively following the discussions around this).

I think you are probably right that the path handling should be handled elsewhere I haven't worked out exactly how extracting paths would be implemented. Maybe that would be a good follow on PR

Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?

Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.

We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.

Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)

Tracking in [Variant] Support Shredded Objects in variant_get #8083

We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.

Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)

Right, once we get into "true" (unshredded) variant data, we are stuck going row by row. But we can handle the pathing the the (possibly empty) prefix of perfectly shredded columns on a column-wise basis. Identify them when constructing the builders, and then just "snap" them into place when finalizing the builder. No per-row shenanigans needed. If a path isn't perfectly shredded, then the builder needs to take the variant column as input and finish the pathing row by row.

For example, suppose that the user requested path v:a.b.c.d::INT, and that a and b were shredded but not c. Then the corresponding builder would take v.typed_value.a.typed_value.b.value as its input column, with c.d::INT as the path to extract on a per-row basis. And then its finalizer would return the resulting (now shredded) ArrayRef for the parent builder for v:a.b to snap into place.

Meanwhile, if the user also requested v:a.b.x.y::INT which happens to be perfectly shredded, then the corresponding "builder" would take v.typed_value.a.typed_value.b.typed_value.x.typed_value.y.typed_value as its input column. It would not need to extract anything (so its per-row operation is a no-op) and its finalizer just needs to return the unchanged input ArrayRef so the parent builder for v:a.b can "snap" it into the correct place while assembling the StructArray that represents v:a.b.

Sounds good -- I will be honest I didn't follow the entire thing but it sounds like it is on the right track to me

Given the length and size of this PR and the comments, I plan to merge it now and i think we'll end up working these questions out as we implement object shredding, under the aegis of

[Variant] Support Shredded Objects in variant_get #8083

scovich · 2025-08-05T03:28:59Z

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

alamb · 2025-08-05T13:19:43Z

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

No worries -- I'll address @carpecodeum and @Samyak2 's comments in the mean time

Co-authored-by: Samyak Sarnayak <[email protected]>

…ng_strawman

scovich · 2025-08-05T13:32:03Z

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

No worries -- I'll address @carpecodeum and @Samyak2 's comments in the mean time

Ping or request a review when ready?

alamb · 2025-08-05T16:03:32Z

Ping or request a review when ready?

I think this is ready for review - it has not substantially changed

alamb · 2025-08-05T16:32:09Z

To be clear, my ideal outcome would be:

We agree that the structure of this PR is ok (we can also change it in future PRs, but I want to make sure it looks ok before merging)
I will file a bunch of follow on tickets to break down the work of filling out the variant feature set

scovich

Running out of time for today, so will flush what little I have so far.

Tomorrow is another day.

scovich · 2025-08-05T19:20:38Z

parquet-variant-compute/src/variant_array.rs

+        let Some(metadata) = metadata_field.as_binary_view_opt() else {
            return Err(ArrowError::NotYetImplemented(format!(


aside: Out of curiosity, do we have any style/preference guidelines for

let Some(foo) = foo_opt else { return Err(...); };

vs.

let foo = foo_opt.unwrap_or_else(|| { ... })?;

?

The former seems both more direct and less verbose, at least in this case?
Maybe unwrap_or_else is mostly (only?) useful as part of a bigger monadic chain?

I personally prefer this pattern for the reasons you suggest. However, I don't know of anything formal we have discussed or agreed on

let Some(foo) = foo_opt else { return Err(...); };

parquet-variant-compute/src/variant_array.rs

scovich · 2025-08-05T20:33:25Z

parquet-variant-compute/src/variant_array.rs

+        match &self.shredding_state {
+            ShreddingState::Unshredded { metadata, value } => {
+                Variant::new(metadata.value(index), value.value(index))
+            }
+            ShreddingState::FullyShredded {
+                metadata: _,
+                typed_value,
+            } => {
+                if typed_value.is_null(index) {
+                    Variant::Null
+                } else {
+                    typed_value_to_variant(typed_value, index)
+                }
+            }
+            ShreddingState::PartiallyShredded {
+                metadata,
+                value,
+                typed_value,
+            } => {
+                if typed_value.is_null(index) {
+                    Variant::new(metadata.value(index), value.value(index))
+                } else {
+                    typed_value_to_variant(typed_value, index)
+                }
+            }
+        }


Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing value and typed_value as options? Especially since the shredding spec seems to suggest that none-none case is a valid combination?

required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }

and

(the file should be able to omit both physical columns, in case of a perfectly shredded all-null the logical column)

Suggested change

match &self.shredding_state {

ShreddingState::Unshredded { metadata, value } => {

Variant::new(metadata.value(index), value.value(index))

}

ShreddingState::FullyShredded {

metadata: _,

typed_value,

} => {

if typed_value.is_null(index) {

Variant::Null

} else {

typed_value_to_variant(typed_value, index)

}

}

ShreddingState::PartiallyShredded {

metadata,

value,

typed_value,

} => {

if typed_value.is_null(index) {

Variant::new(metadata.value(index), value.value(index))

} else {

typed_value_to_variant(typed_value, index)

}

}

}

// Always prefer typed_value over value (if present).

match self.typed_value.as_ref().and_then(|tv| typed_value_to_variant(tv, index)) {

Some(typed_value) => typed_value,

None => self.value.as_ref().map_or(

Variant::Null,

|v| Variant::new(self.metadata.value(index), v.value(index)),

)),

}

(this assumes that typed_value_to_variant includes the null check and returns Option<Variant>)

Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing value and typed_value as options?

The reason I first introduced the enum was that in variant_get I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.

Another reason reason was that I ended up with redundant error checking that was in the VariantArray::try_new (though if perfectly null arrays are allowed, maybe this becomes irrelevant)

Especially since the shredding spec seems to suggest that none-none case is a valid combination?

I personally think having names for the different shredding types makes working with the code easier, even if all 4 combinations are valid.

I think you are right, however, that the shredding spec allows both to be None. I double checked the Arrow Proposal as it also implies both being null is permitted:

An optional field named value that is binary, large_binary, or binary_view
An optional field named typed_value which can be any primitive type or be a

My proposal is to leave the ShreddingState enum in and I will file a follow on ticket to support the none-none case

The reason I first introduced the enum was that in variant_get I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.

I guess... but it seems like there's so much code just managing the resulting enum variants that I question whether it's a net win. Like the getters for metadata, value, etc. or the above code that shrinks from 25 lines to seven while handling more cases.

Clearly either way can be made correct, tho, so I guess it's a matter of preference.

yeah, I agree it is a matter of preference. If someone else feels strongly, I am not opposed to changing it

scovich · 2025-08-05T20:38:41Z

parquet-variant-compute/src/variant_array.rs

+        self.shredding_state.metadata_field()
    }

-    fn find_value_field(array: &StructArray) -> Option<ArrayRef> {
-        array.column_by_name("value").cloned()
+    /// Return a reference to the value field of the `StructArray`
+    pub fn value_field(&self) -> Option<&BinaryViewArray> {
+        self.shredding_state.value_field()
    }

-    /// Return a reference to the metadata field of the [`StructArray`]
-    pub fn metadata_field(&self) -> &ArrayRef {
-        // spec says fields order is not guaranteed, so we search by name
-        &self.metadata_ref
+    /// Return a reference to the typed_value field of the `StructArray`, if present
+    pub fn typed_value_field(&self) -> Option<&ArrayRef> {
+        self.shredding_state.typed_value_field()


These become trivial if we ditch the ShreddingState enum

scovich · 2025-08-05T20:55:30Z

parquet-variant-compute/src/variant_array.rs

+}
+
+/// returns the non-null element at index as a Variant
+fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {


Suggest to make this return Option<Variant> so callers don't have to check for null themselves.

Suggested change

fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {

fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {

if typed_value.is_null(index) {

return None;

}

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead -- see https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value for example

/// Consistently with other Arrow arrays types, this API requires you to /// check for nulls first using [`Self::is_valid`].

I think the reason it to allow a better chance at LLVM vectorizing the code, which I don't think is likely relevant here.

We could make the Variant API deviate from this pattern (it is already different in several other ways) and return Option<Variant>.

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

We can always change it if we discover it hurts performance or readability.

I think the reason it to allow a better chance at LLVM vectorizing the code

AFAIK, arrow normally requires all array entries to be physically valid, even when logically null. That way, one can perform columnar operations blindly and then just use the null masks to ignore the unwanted values after the fact. Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

Sorry -- I wasn't clear -- the reason I was talking about VariantArray::value is that it is the only caller of typed_value_to_variant so if we return Option from this value, we would just be stuck at the same place 🤔

Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

Yeah I agree I don't think it will make any difference for performance with Cariants. The primary reason in my mind is to be consistent with other APIs.

I think in an earlier version of this PR I actually had changed value() return Option. Maybe changing the signature is a good idea 🤔

scovich · 2025-08-05T21:33:01Z

parquet-variant-compute/src/variant_array.rs

+    match typed_value.data_type() {
+        DataType::Int32 => {
+            let typed_value = typed_value.as_primitive::<Int32Type>();
+            Variant::from(typed_value.value(index))


Good use for downcast_primitive_array macro?

Suggested change

match typed_value.data_type() {

DataType::Int32 => {

let typed_value = typed_value.as_primitive::<Int32Type>();

Variant::from(typed_value.value(index))

downcast_primitive_array! {

typed_value => Some(Variant::from(typed_value.value(index))),

I couldn't make that work as it needs more information than just the native type (e.g. the precision/scale for decimals).

@superserious-dev 's cast_conversion macro I think can be adapted eventually to handle this case

https://github.com/apache/arrow-rs/pull/8074/files#diff-753713bd9a1945a5fc998fab6018eac0b08ebb4d04d3258b2b615498152a9303R46

scovich · 2025-08-05T21:39:55Z

parquet-variant-compute/src/variant_array.rs

+            (_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
+                "VariantArray has neither value nor typed_value field",


I'm pretty sure the shredding spec allows this case. It corresponds to a perfectly shredded all-null column where the writer chose to omit both of them as unnecessary.

(see other comment above)

Makes sense -- I'll file a follow on ticket to add support

[Variant] Implement ShreddingState::AllNull variant #8088

Co-authored-by: Ryan Johnson <[email protected]>

…ng_strawman

carpecodeum · 2025-08-06T15:35:52Z

looks great!

scovich

Dropped a bunch of thoughts about nesting and pathing. Hopefully helpful.

parquet-variant-compute/src/variant_array.rs

scovich · 2025-08-06T14:14:45Z

parquet-variant-compute/src/variant_array.rs

-        &self.value_ref
+/// Variant arrays can be shredded in one of three states, encoded here
+#[derive(Debug)]
+pub enum ShreddingState {


For a future FullyShreddedAllNull variant (neither value nor typed_value present), would we still need to store the metadata even tho it's never actually used? 🤔

I am not sure

I filed a ticket to track adding AllNull:

[Variant] Implement ShreddingState::AllNull variant #8088

scovich · 2025-08-06T14:21:52Z

parquet-variant-compute/src/variant_array.rs

-        let met = self.metadata_ref.slice(offset, length);
-        let val = self.value_ref.slice(offset, length);
+        let inner = self.inner.slice(offset, length);
+        let shredding_state = self.shredding_state.slice(offset, length);
        Arc::new(Self {


Out of curiosity, would we actually expect a redundant slice call to be notably more expensive than clone?
Seems like they'd do the same thing?

For example, BooleanArray::slice is:

pub fn slice(&self, offset: usize, length: usize) -> Self { Self { values: self.values.slice(offset, length), nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)), } }

with BooleanBuffer::slice:

pub fn slice(&self, offset: usize, len: usize) -> Self { assert!( offset.saturating_add(len) <= self.len, "the length + offset of the sliced BooleanBuffer cannot exceed the existing length" ); Self { buffer: self.buffer.clone(), offset: self.offset + offset, len, } }

Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.

scovich · 2025-08-06T15:08:42Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+
+            // if the typed value is null, decode the variant and extract the value
+            if typed_value.is_null(i) {
+                // todo follow path


Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?

scovich · 2025-08-06T15:26:57Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+
+            // if the typed value is null, decode the variant and extract the value
+            if typed_value.is_null(i) {
+                // todo follow path


Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.

scovich · 2025-08-06T15:52:01Z

parquet-variant-compute/src/variant_get/mod.rs

+    // Dispatch based on the shredding state of the input variant array
+    match variant_array.shredding_state() {


One (big) complexity here -- the requested type could differ wildly from how the input data are currently shredded.

Pathing-wise, we'd need to drill as deeply as possible into the variant array's physical shredding schema, following the requested path. In the happy path, the requested path was shredded and we just have to deal with the corresponding (value, typed_value) pair of columns.

Example

Suppose we're trying to extract v:a.b::INT from a VariantArray with the following physical layout:

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: STRUCT { a: SHREDDED_VARIANT { value: BINARY, typed_value: STRUCT { b: SHREDDED_VARIANT { value: BINARY, typed_value: INT, }, }, }, }, }

Then we're actually doing v.typed_value.a.typed_value.b::INT -- the builder only has to worry about reconciling v.typed_value.a.typed_value.b.value with v.typed_value.a.typed_value.b.typed_value (which can be passed in directly at construction time)

This case has an even easier example, if the caller requested v:a (an untyped variant extract) -- in that case we simply combine v.typed_value.a with v.metadata into a new VariantArray -- no casting needed because VariantArray can capture the full structure v:a no matter how it was physically shredded.

In a less-happy path, we will hit a partially shredded object where the requested path element is not available in typed_value, and we must go row-oriented from there (with the remaining path extraction performed on a per-row basis):

Example

Suppose we're trying to extract v:a.b::INT from a VariantArray with the following physical layout:

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: STRUCT { a: SHREDDED_VARIANT { value: BINARY, typed_value: STRUCT { x: SHREDDED_VARIANT { value: BINARY, typed_value: INT, }, }, }, }, }

Then we're actually doing v.typed_value.a:b::INT -- we have to pass v.typed_value.a.value into the builder, which has to extract field b from every row.

In an unhappy path, the partial shredding of the input disagrees with the output's requested shredding, and we have to build up a struct from both shredded and unshredded paths.

Example

Suppose the requested schema asks for both v:a.b::INT and v:a.x::INT, when the VariantArray's layout is:

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: STRUCT { a: SHREDDED_VARIANT { value: BINARY, typed_value: STRUCT { b: SHREDDED_VARIANT { value: BINARY, typed_value: INT, }, }, }, }, }

Now we have to actually build the struct v.a with two fields by unioning shredded (v.typed_value.a.typed_value.b::INT and unshredded bits (v.typed_value.a.value:x::INT).

In a really unhappy case, the caller requested a shredded variant type that disagrees with the input data's shredding. So some fields have to be shredded and others unshredded.

Example

Suppose we're trying to extract both v:a as a partially shredded variant with x shredded and b not shredded... from a VariantArray where b is shredded and x is not shredded:

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: STRUCT { a: SHREDDED_VARIANT { value: BINARY, typed_value: STRUCT { b: SHREDDED_VARIANT { value: BINARY, typed_value: INT, }, }, }, }, }

Then not only would we need to path into v.typed_value.a.typed_value.b::VARIANT (which requires unshredding it and which produces an o.typed_value.a.value_b column), we'd also have to path into v.typed_value.a.value::INT (which requires shredding it and which contributes to both o.typed_value.a.value_x and o.typed_value.a.typed_value columns). And then we have to variant-union o.typed_value.a.value_b and o.typed_value.a.value_xcolumns into a single o.typed_value.a.value column, and pack everything into a new VariantArray.

Yeah, exactly, this I think is the key challenge of this code -- each different type of output needs to handle all the possible input shredding types as well.

scovich · 2025-08-06T15:53:31Z

parquet-variant-compute/src/variant_get/mod.rs

+    /// if `as_type` is None, the returned array will itself be a VariantArray.
+    ///
+    /// if `as_type` is `Some(type)` the field is returned as the specified type.
+    pub as_type: Option<FieldRef>,


Presumably None and Some(VariantType) are equivalent?

If so, should we just default-initialize as_type: FieldRef to VariantType?

Oh... but variant is not a single type... every possible shredding (or lack thereof) is a different physical DataType. So by passing None, we request to keep whatever structure the underlying path already has; by passing Some(... variant type...) we're requesting to [un|re]shred as necessary to make it fit the new shredding schema.

scovich · 2025-08-06T16:04:23Z

parquet-variant-compute/src/variant_get/mod.rs

+    /// Construct options to get the specified path as a variant.
+    pub fn new_with_path(path: VariantPath<'a>) -> Self {


I guess path is optional (defaults to empty) because a leaf extract of a perfectly shredded variant doesn't need any internal pathing -- whoever constructs the builder just passes the leaf columns and the builder only needs to cast them appropriately.

scovich · 2025-08-06T16:14:35Z

parquet-variant-compute/src/variant_get/output/mod.rs

Currently, the output builder seems to be fully column-oriented -- it assumes that all values for each leaf column are extracted in a tight loop. This can work for primitive builders, but nested builders will quickly run into pathing and efficiency problems.

I think we'll need to do something similar to the JSON builder, with a row-oriented approach where each level of a nested builder receives an already-constructed Variant for the current row and does a field extract for each child builder; child builders can then cast the result directly or recurse further as needed (based on their own type). And then the top-level builder call would construct a Variant for each row to kick-start the process.

But see the other comment -- to the extend that the shredding aligns nicely, we can hoist a subset of this per-row pathing of the append method up to columnar pathing of the builder's constructor and finalizer.

I think this aligns with my thinking and is nicely described -- I will use it as the base for the next set of tickets perhaps

Filed

[Variant] Support Shredded Objects in variant_get #8083

…ng_strawman

Co-authored-by: Ryan Johnson <[email protected]>

…rrow-rs into alamb/variant_shredding_strawman

alamb · 2025-08-07T17:08:53Z

I have gone though the comments in this PR -- thank you @carpecodeum @Samyak2 and @scovich

I would very much like to unblock other people working on various parts of variant_get in parallel.

Thus, my plans now are:

Ensure the names for shredding align with the Variant shredding spec
Merge this PR in
File a bunch of follow on tickets -- several will be quite substantial like reading shredding objects, lists and pathing. However, I think we can work on much of it in parallel

alamb · 2025-08-07T18:22:52Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+
+// todo for other primitive types


Suggested change

// todo for other primitive types

alamb

Ok, I filed a bunch more tickets now

alamb · 2025-08-07T18:24:56Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+
+            // if the typed value is null, decode the variant and extract the value
+            if typed_value.is_null(i) {
+                // todo follow path


Tracking in [Variant] Support Shredded Objects in variant_get #8083

alamb · 2025-08-07T18:28:58Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+            // TODO: try to cast the typed_value to the desired type?
+            Err(ArrowError::NotYetImplemented(format!(


filed [Variant] Casting errors behavior support in variant_get #8086

alamb · 2025-08-07T18:29:07Z

parquet-variant-compute/src/variant_get/output/primitive.rs

+        if typed_value.data_type() == self.as_type.data_type() {
+            Ok(typed_value.clone())
+        } else {
+            // TODO: try to cast the typed_value to the desired type?


Suggested change

// TODO: try to cast the typed_value to the desired type?

// TODO: try to cast the typed_value to the desired type?

// https://github.com/apache/arrow-rs/issues/8086

alamb · 2025-08-07T18:37:11Z

parquet-variant-compute/src/variant_get/output/variant.rs

+                }
+            }
+            dt => {
+                return Err(ArrowError::NotYetImplemented(format!(


Filed [Variant] Casting errors behavior support in variant_get #8086 for the remaining types

Suggested change

return Err(ArrowError::NotYetImplemented(format!(

// https://github.com/apache/arrow-rs/issues/8086

return Err(ArrowError::NotYetImplemented(format!(

alamb · 2025-08-07T18:40:42Z

parquet-variant-compute/src/variant_array.rs

+/// [Parquet Variant Shredding Spec]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding
+#[derive(Debug)]
+pub enum ShreddingState {
+    // TODO: add missing state where there is neither value nor typed_value


Suggested change

// TODO: add missing state where there is neither value nor typed_value

// TODO: add missing state where there is neither value nor typed_value

// https://github.com/apache/arrow-rs/issues/8088

alamb · 2025-08-07T18:41:08Z

parquet-variant-compute/src/variant_array.rs

-        &self.value_ref
+/// Variant arrays can be shredded in one of three states, encoded here
+#[derive(Debug)]
+pub enum ShreddingState {


I am not sure

I filed a ticket to track adding AllNull:

[Variant] Implement ShreddingState::AllNull variant #8088

alamb · 2025-08-07T18:41:17Z

parquet-variant-compute/src/variant_array.rs

+            (_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
+                "VariantArray has neither value nor typed_value field",


[Variant] Implement ShreddingState::AllNull variant #8088

alamb · 2025-08-07T18:59:42Z

This implementation is far from complete, but I don't see any objections to the fundamental structure.

So, let's merge this one in and work on improving the situation in follow on PRs

alamb · 2025-08-07T19:00:05Z

Whops, I forgot to add some links. I will make a new PR for that

alamb · 2025-08-07T19:40:45Z

Whops, I forgot to add some links. I will make a new PR for that

[Variant] Minor: Add comments to tickets for follow on items #8092

# Which issue does this PR close? - Follow on to #8021 # Rationale for this change Let's add links to the relevant tickets in the code so future readers who encounter it can find the relevant context # What changes are included in this PR? Add comments with links to tickets # Are these changes tested? N/A (just comments) # Are there any user-facing changes? No

@scovich

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Related to #8021 # Rationale for this change As part of the review in #8021, @scovich and I were discussing how `VariantArray::value` should behave in the presence of nulls: #8021 (comment) > Suggest to make this return Option<Variant> so callers don't have to check for null themselves. I realized it might not be 100% clear that the existing convention in this crate was that `value()` methods did not check for nulls / return `Option`. I think we should document it better # What changes are included in this PR? Explicitly document that `value` methods do not check for nulls and explain what happens when they are used on null values # Are these changes tested? Yes, by CI # Are there any user-facing changes? Additional documentation. No behavior changes --------- Co-authored-by: Kyle Barron <[email protected]>

github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2025

alamb commented Jul 30, 2025

View reviewed changes

parquet-variant-compute/src/variant_get.rs Outdated Show resolved Hide resolved

alamb force-pushed the alamb/variant_shredding_strawman branch from cd55763 to adf552c Compare August 2, 2025 16:07

alamb marked this pull request as draft August 2, 2025 16:08

alamb changed the title ~~WIP [Variant] Strawman variant_get for shredded variants~~ WIP [Variant] Strawman variant_get and Shredded variants Aug 2, 2025

alamb commented Aug 2, 2025

View reviewed changes

[Variant] Strawman / infrastructure for variant_get of shredded values

3ba9fe0

alamb force-pushed the alamb/variant_shredding_strawman branch from adf552c to 3ba9fe0 Compare August 2, 2025 18:38

alamb marked this pull request as ready for review August 2, 2025 19:34

alamb changed the title ~~WIP [Variant] Strawman variant_get and Shredded variants~~ [Variant] Proposal for variant_get and Shredded VariantArray Aug 2, 2025

alamb mentioned this pull request Aug 2, 2025

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

Open

carpecodeum reviewed Aug 3, 2025

View reviewed changes

Samyak2 approved these changes Aug 3, 2025

View reviewed changes

alamb and others added 2 commits August 5, 2025 09:28

Apply suggestions from code review

f458ec5

Co-authored-by: Samyak Sarnayak <[email protected]>

Merge remote-tracking branch 'apache/main' into alamb/variant_shreddi…

52cf52b

…ng_strawman

alamb added 2 commits August 5, 2025 09:34

remove completed todo comment

6b66650

Move validation logic to ShreddingState::try_new

924155c

alamb force-pushed the alamb/variant_shredding_strawman branch from 222eb62 to 924155c Compare August 5, 2025 15:43

scovich reviewed Aug 5, 2025

View reviewed changes

alamb and others added 5 commits August 6, 2025 07:27

Apply suggestions from code review

5cb1068

Co-authored-by: Ryan Johnson <[email protected]>

cleanup

9c22f85

Avoid panics in production code

eee199e

Merge remote-tracking branch 'apache/main' into alamb/variant_shreddi…

264356b

…ng_strawman

comments

336a0a3

alamb mentioned this pull request Aug 6, 2025

Docs: Clarify that Array::value does not check for nulls #8065

Merged

carpecodeum approved these changes Aug 6, 2025

View reviewed changes

scovich reviewed Aug 6, 2025

View reviewed changes

alamb and others added 5 commits August 7, 2025 12:46

Merge remote-tracking branch 'apache/main' into alamb/variant_shreddi…

016cee1

…ng_strawman

Fix up merge

3e8820b

Update parquet-variant-compute/src/variant_array.rs

55a7ae6

Co-authored-by: Ryan Johnson <[email protected]>

Merge branch 'alamb/variant_shredding_strawman' of github.com:alamb/a…

85e7194

…rrow-rs into alamb/variant_shredding_strawman

fix compilation

6e38490

alamb added 3 commits August 7, 2025 13:20

update docs

0ccd0f8

Rename ShreddingState to conform better to spec

f8b2df4

Update name in output builder

2355247

This was referenced Aug 7, 2025

[Variant] Support Shredded Lists/Array in variant_get #8082

Open

[Variant] Support Shredded Objects in variant_get #8083

Closed

alamb commented Aug 7, 2025

View reviewed changes

This was referenced Aug 7, 2025

[Variant] Casting errors behavior support in variant_get #8086

Open

[Variant] Support typed access for other types in variant_get #8087

Open

[Variant] Implement ShreddingState::AllNull variant #8088

Closed

alamb commented Aug 7, 2025

View reviewed changes

alamb changed the title ~~[Variant] Proposal for variant_get and Shredded VariantArray~~ [Variant] Add variant_get and Shredded VariantArray Aug 7, 2025

alamb merged commit c561acb into apache:main Aug 7, 2025
14 checks passed

This was referenced Aug 7, 2025

[Variant] Implement VariantArray::value for shredded variants #8091

Closed

[Variant] Minor: Add comments to tickets for follow on items #8092

Merged

alamb deleted the alamb/variant_shredding_strawman branch August 7, 2025 19:40

		@@ -1,180 +0,0 @@
		// Licensed to the Apache Software Foundation (ASF) under one

		let Some(metadata) = metadata_field.as_binary_view_opt() else {
		return Err(ArrowError::NotYetImplemented(format!(

		(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
		"VariantArray has neither value nor typed_value field",

		// Dispatch based on the shredding state of the input variant array
		match variant_array.shredding_state() {

		/// Construct options to get the specified path as a variant.
		pub fn new_with_path(path: VariantPath<'a>) -> Self {

		// TODO: try to cast the typed_value to the desired type?
		Err(ArrowError::NotYetImplemented(format!(

	// TODO: try to cast the typed_value to the desired type?
	// TODO: try to cast the typed_value to the desired type?
	// https://github.com/apache/arrow-rs/issues/8086

	// TODO: add missing state where there is neither value nor typed_value
	// TODO: add missing state where there is neither value nor typed_value
	// https://github.com/apache/arrow-rs/issues/8088

[Variant] Add variant_get and Shredded VariantArray #8021

[Variant] Add variant_get and Shredded VariantArray #8021

Uh oh!

Conversation

alamb commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 2, 2025

Uh oh!

carpecodeum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samyak2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

[Variant] Add `variant_get` and Shredded `VariantArray` #8021

[Variant] Add `variant_get` and Shredded `VariantArray` #8021

alamb commented Jul 29, 2025 •

edited

Loading

alamb Aug 2, 2025 •

edited

Loading

scovich Aug 7, 2025 •

edited

Loading