Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 29, 2025

Which issue does this PR close?

Rationale for this change

This is has a proposal for how to structure shredded VariantArrays and the variant_get kernel

If people like the basic idea I will file some more tickets to track additional follow on work

It is based on ideas ideas from @carpecodeum in #7946 and @scovich in #7915

I basically took the tests from #7965 and the conversation with @scovich recorded from #7941 (comment) and I bashed out how this might look

What changes are included in this PR?

  1. Update VariantArray to represent shredding
  2. Add code to variant_get to support extracting paths as both variants and typed fields
  3. A pattern that I think can represent shredding and extraction
  4. Tests for same

Note there are many things that are NOT in this PR that I envision doing as follow on PRs:

  1. Support and implementing Paths
  2. Support for shredded objects
  3. Support shredded lists
  4. Support nested objects / lists
  5. Full casting support
  6. Support for other output types: StringArray, StringViewArray, etc
  7. Many performance improvements

Are these changes tested?

Yes

Are there any user-facing changes?

New feature

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2025
@alamb alamb force-pushed the alamb/variant_shredding_strawman branch from cd55763 to adf552c Compare August 2, 2025 16:07
@alamb alamb marked this pull request as draft August 2, 2025 16:08
@alamb alamb changed the title WIP [Variant] Strawman variant_get for shredded variants WIP [Variant] Strawman variant_get and Shredded variants Aug 2, 2025
shredding_state: ShreddingState,
}

/// Variant arrays can be shredded in one of three states, encoded here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this enum it is easier to know what the state of any particular array is

let metadata = self.metadata_field().as_binary_view().value(index);
let value = self.value_field().as_binary_view().value(index);
Variant::new(metadata, value)
pub fn value(&self, index: usize) -> Option<Variant> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part -- accessing a single Variant value from a potentially shredded VariantArray

@@ -1,180 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually left most of this file the same but renamed it to variant_get/mod.rs, despite how github is rendering it

)
})?;

// Create the output writer based on the specified output options
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core design for get is here: Different potential builders depending on the type of the output array -- which gives us a place to put special code for each output array type, this is the OutputBuilder trait -- perhaps someone can come up with a better name.

);
}

/// Shredding: extract a value as a VariantArray
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are the new shredded tests

/// Trait for Arrow primitive types that can be used in the output builder
///
/// This just exists to add a generic way to convert from Variant to the primitive type
pub(super) trait ArrowPrimitiveVariant: ArrowPrimitiveType {
Copy link
Contributor Author

@alamb alamb Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is a bit complicated due to generics but I think it will work for all primitive types (ints, floats, etc)

It handles extracting strongly typed values from variants

_value_field: &BinaryViewArray,
typed_value: &ArrayRef,
) -> arrow::error::Result<ArrayRef> {
// build up the output array element by element
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code that converts a shredded variant into a typed output -- It can probably be made quite a bit faster with some more deliberate vectorization, but I think it is relatively simple and functional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a minute to realize that "partially shredded" here does not refer to the variant shredding spec definition:

An object is partially shredded when the value is an object and the typed_value is a shredded object. Writers must not produce data where both value and typed_value are non-null, unless the Variant value is an object.

and

The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value.

This function seems to actually refer to "imperfectly shredded" columns, where we need a value column as a fallback for input values that fail to shred? As opposed to "perfectly shredded" columns that only need a typed_value column (or where the value column is all-null).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent call -- I have updated the names to match the shredding specification

typed_value: &ArrayRef,
) -> arrow::error::Result<ArrayRef> {
// if the types match exactly, we can just return the typed_value
if typed_value.data_type() == self.as_type.data_type() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the fast path if we have perfectly shredded value

// in this case dispatch on the typed_value and
// TODO macro'ize this using downcast! to handle all other primitive types
// TODO(perf): avoid builders entirely (and write the raw variant directly as we know the metadata is the same)
let mut array_builder = VariantArrayBuilder::new(variant_array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic to reconstruct Variants from typed values

_metadata: &BinaryViewArray,
_value_field: &BinaryViewArray,
) -> arrow::error::Result<ArrayRef> {
let mut builder = VariantArrayBuilder::new(variant_array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case that is handled on main

@alamb alamb force-pushed the alamb/variant_shredding_strawman branch from adf552c to 3ba9fe0 Compare August 2, 2025 18:38
@alamb alamb marked this pull request as ready for review August 2, 2025 19:34
@alamb alamb changed the title WIP [Variant] Strawman variant_get and Shredded variants [Variant] Proposal for variant_get and Shredded VariantArray Aug 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2025

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Copy link
Contributor

@carpecodeum carpecodeum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP review

let met = self.metadata_ref.slice(offset, length);
let val = self.value_ref.slice(offset, length);
let inner = self.inner.slice(offset, length);
let shredding_state = self.shredding_state.slice(offset, length);
Arc::new(Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid unnecessary allocations for common cases, like -

let shredding_state = match (&self.shredding_state, offset, length) {
    // Fast path: no slice needed for full array
    (state, 0, len) if len == self.len() => state.clone(),
    
    // Fast path: uniform shredding doesn't need slicing
    (ShreddingState::None, _, _) => ShreddingState::None,
    (ShreddingState::All, _, _) => ShreddingState::All,
    
    // Only slice for mixed case
    (ShreddingState::Mixed(bitmap), offset, length) => {
        ShreddingState::Mixed(bitmap.slice(offset, length))
    }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely add a fast path for slicing the whole array.

I am not quite sure what ShreddingState::All and ShreddingState::Mixed is supposed to represent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, would we actually expect a redundant slice call to be notably more expensive than clone?
Seems like they'd do the same thing?

For example, BooleanArray::slice is:

    pub fn slice(&self, offset: usize, length: usize) -> Self {
        Self {
            values: self.values.slice(offset, length),
            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
        }
    }

with BooleanBuffer::slice:

    pub fn slice(&self, offset: usize, len: usize) -> Self {
        assert!(
            offset.saturating_add(len) <= self.len,
            "the length + offset of the sliced BooleanBuffer cannot exceed the existing length"
        );
        Self {
            buffer: self.buffer.clone(),
            offset: self.offset + offset,
            len,
        }
    }

Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I don't expect any measurable performance improvement

Copy link
Contributor

@Samyak2 Samyak2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!
I only had one concern around paths. Other comments are nitpicks.

Comment on lines 311 to 323
/// returns the non-null element at index as a Variant
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
match typed_value.data_type() {
DataType::Int32 => {
let typed_value = typed_value.as_primitive::<Int32Type>();
Variant::from(typed_value.value(index))
}
// todo other types here
_ => {
todo!(); // Unsupported typed_value type
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be useful as a (public) function on its own. For example, this could be used to cast columns to variant like in some_column::variant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea -- I am working to pull it into its own cast_to_variant function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time on this, and I found it was not as simple as I would have liked. Thus I have filed a ticket to track the idea

I filed a ticket here

I will pull that code out into its own PR for easier review as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I couldn't quite figure out how to reuse that kernel code in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate. Maybe a separate issue for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought more about this last night -- the reason I think we'll need special code other than cast_to_variant is that cast_to_variant does the downcast to typed array once, and then loops with the downcast array, where this code needs to downcast once and then convert 🤔 maybe someone can figure out something clever with macros

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had come up with some ideas before I remembered downcast_to_primitive! macro, but I think that would be the approach to take. Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.

Things get a lot trickier if we want to eliminate references to data types, tho, as the downcast_to_primitive! experiment in the other PR showed. We need a lot of homogeneity, or at least to know very clearly what data types are not covered so we can add specific match arms for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, a macro that just does the array casting and allows the caller to inject whatever they want. This site here would inject a scalar lookup, while the cast kernel would inject a for-loop.

I think this is a good idea. I am not sure we can use the downcast_to_primitive as it dispatches based on underlying primtive type (e.g. you get an i128 for a decimal, but don't get precision and scale). Maybe someone else can figure out how to make it work

We could maybe make our own macro variant specific macro though 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate. Maybe a separate issue for that?


// if the typed value is null, decode the variant and extract the value
if typed_value.is_null(i) {
// todo follow path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern with this. I think the path-ing should be done independent of the output builder. Paths are only valid on objects/lists inside the variant. I don't see why every type's output builder needs to be concerned with the path.

Even for shredded objects, we can bring the nested value/typed_value fields up and pass it along like a top-level shredded variant array.

What do you think? Let me know if I missed something (I wasn't actively following the discussions around this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are probably right that the path handling should be handled elsewhere I haven't worked out exactly how extracting paths would be implemented. Maybe that would be a good follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.

Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields.

Yes, I was imagining that this would do the pathing here on each row -- I am not sure what you mean by extract the struct array (as I don't think we can extract variants as structs in general, as the structures vary row by row)

Right, once we get into "true" (unshredded) variant data, we are stuck going row by row. But we can handle the pathing the the (possibly empty) prefix of perfectly shredded columns on a column-wise basis. Identify them when constructing the builders, and then just "snap" them into place when finalizing the builder. No per-row shenanigans needed. If a path isn't perfectly shredded, then the builder needs to take the variant column as input and finish the pathing row by row.

Copy link
Contributor

@scovich scovich Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, suppose that the user requested path v:a.b.c.d::INT, and that a and b were shredded but not c. Then the corresponding builder would take v.typed_value.a.typed_value.b.value as its input column, with c.d::INT as the path to extract on a per-row basis. And then its finalizer would return the resulting (now shredded) ArrayRef for the parent builder for v:a.b to snap into place.

Meanwhile, if the user also requested v:a.b.x.y::INT which happens to be perfectly shredded, then the corresponding "builder" would take v.typed_value.a.typed_value.b.typed_value.x.typed_value.y.typed_value as its input column. It would not need to extract anything (so its per-row operation is a no-op) and its finalizer just needs to return the unchanged input ArrayRef so the parent builder for v:a.b can "snap" it into the correct place while assembling the StructArray that represents v:a.b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- I will be honest I didn't follow the entire thing but it sounds like it is on the right track to me

Given the length and size of this PR and the comments, I plan to merge it now and i think we'll end up working these questions out as we implement object shredding, under the aegis of

@scovich
Copy link
Contributor

scovich commented Aug 5, 2025

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2025

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

No worries -- I'll address @carpecodeum and @Samyak2 's comments in the mean time

@scovich
Copy link
Contributor

scovich commented Aug 5, 2025

@scovich @carpecodeum and @Samyak2 this PR I think is ready for review

Whoop, I somehow missed the notify. I'll try to take a look in the next day or two!

No worries -- I'll address @carpecodeum and @Samyak2 's comments in the mean time

Ping or request a review when ready?

@alamb alamb force-pushed the alamb/variant_shredding_strawman branch from 222eb62 to 924155c Compare August 5, 2025 15:43
@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2025

Ping or request a review when ready?

I think this is ready for review - it has not substantially changed

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2025

To be clear, my ideal outcome would be:

  1. We agree that the structure of this PR is ok (we can also change it in future PRs, but I want to make sure it looks ok before merging)
  2. I will file a bunch of follow on tickets to break down the work of filling out the variant feature set

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running out of time for today, so will flush what little I have so far.

Tomorrow is another day.

Comment on lines +100 to 101
let Some(metadata) = metadata_field.as_binary_view_opt() else {
return Err(ArrowError::NotYetImplemented(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Out of curiosity, do we have any style/preference guidelines for

let Some(foo) = foo_opt else {
    return Err(...);
};

vs.

let foo = foo_opt.unwrap_or_else(|| {
    ...
})?;

?

The former seems both more direct and less verbose, at least in this case?
Maybe unwrap_or_else is mostly (only?) useful as part of a bigger monadic chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer this pattern for the reasons you suggest. However, I don't know of anything formal we have discussed or agreed on

let Some(foo) = foo_opt else {
    return Err(...);
};

Comment on lines 169 to 194
match &self.shredding_state {
ShreddingState::Unshredded { metadata, value } => {
Variant::new(metadata.value(index), value.value(index))
}
ShreddingState::FullyShredded {
metadata: _,
typed_value,
} => {
if typed_value.is_null(index) {
Variant::Null
} else {
typed_value_to_variant(typed_value, index)
}
}
ShreddingState::PartiallyShredded {
metadata,
value,
typed_value,
} => {
if typed_value.is_null(index) {
Variant::new(metadata.value(index), value.value(index))
} else {
typed_value_to_variant(typed_value, index)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing value and typed_value as options? Especially since the shredding spec seems to suggest that none-none case is a valid combination?

required group measurement (VARIANT) {
  required binary metadata;
  optional binary value;
  optional int64 typed_value;
}

and

image

(the file should be able to omit both physical columns, in case of a perfectly shredded all-null the logical column)

Suggested change
match &self.shredding_state {
ShreddingState::Unshredded { metadata, value } => {
Variant::new(metadata.value(index), value.value(index))
}
ShreddingState::FullyShredded {
metadata: _,
typed_value,
} => {
if typed_value.is_null(index) {
Variant::Null
} else {
typed_value_to_variant(typed_value, index)
}
}
ShreddingState::PartiallyShredded {
metadata,
value,
typed_value,
} => {
if typed_value.is_null(index) {
Variant::new(metadata.value(index), value.value(index))
} else {
typed_value_to_variant(typed_value, index)
}
}
}
// Always prefer typed_value over value (if present).
match self.typed_value.as_ref().and_then(|tv| typed_value_to_variant(tv, index)) {
Some(typed_value) => typed_value,
None => self.value.as_ref().map_or(
Variant::Null,
|v| Variant::new(self.metadata.value(index), v.value(index)),
)),
}

(this assumes that typed_value_to_variant includes the null check and returns Option<Variant>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders whether this "shredding state" enum is actually helping, vs. just storing value and typed_value as options?

The reason I first introduced the enum was that in variant_get I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.

Another reason reason was that I ended up with redundant error checking that was in the VariantArray::try_new (though if perfectly null arrays are allowed, maybe this becomes irrelevant)

Especially since the shredding spec seems to suggest that none-none case is a valid combination?

I personally think having names for the different shredding types makes working with the code easier, even if all 4 combinations are valid.

I think you are right, however, that the shredding spec allows both to be None. I double checked the Arrow Proposal as it also implies both being null is permitted:

An optional field named value that is binary, large_binary, or binary_view
An optional field named typed_value which can be any primitive type or be a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal is to leave the ShreddingState enum in and I will file a follow on ticket to support the none-none case

Copy link
Contributor

@scovich scovich Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I first introduced the enum was that in variant_get I needed to dispatch to different methods based on the type of shredding, and I found it myself really wanting to have names for them rather than just the combinations of value and typed value.

I guess... but it seems like there's so much code just managing the resulting enum variants that I question whether it's a net win. Like the getters for metadata, value, etc. or the above code that shrinks from 25 lines to seven while handling more cases.

Clearly either way can be made correct, tho, so I guess it's a matter of preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree it is a matter of preference. If someone else feels strongly, I am not opposed to changing it

Comment on lines +199 to +209
self.shredding_state.metadata_field()
}

fn find_value_field(array: &StructArray) -> Option<ArrayRef> {
array.column_by_name("value").cloned()
/// Return a reference to the value field of the `StructArray`
pub fn value_field(&self) -> Option<&BinaryViewArray> {
self.shredding_state.value_field()
}

/// Return a reference to the metadata field of the [`StructArray`]
pub fn metadata_field(&self) -> &ArrayRef {
// spec says fields order is not guaranteed, so we search by name
&self.metadata_ref
/// Return a reference to the typed_value field of the `StructArray`, if present
pub fn typed_value_field(&self) -> Option<&ArrayRef> {
self.shredding_state.typed_value_field()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These become trivial if we ditch the ShreddingState enum

}

/// returns the non-null element at index as a Variant
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to make this return Option<Variant> so callers don't have to check for null themselves.

Suggested change
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
fn typed_value_to_variant(typed_value: &ArrayRef, index: usize) -> Variant {
if typed_value.is_null(index) {
return None;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead -- see https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value for example

    /// Consistently with other Arrow arrays types, this API requires you to
    /// check for nulls first using [`Self::is_valid`].

I think the reason it to allow a better chance at LLVM vectorizing the code, which I don't think is likely relevant here.

We could make the Variant API deviate from this pattern (it is already different in several other ways) and return Option<Variant>.

Copy link
Contributor

@scovich scovich Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other Arrow Array apis don't return Option, they instead return the value directly requiring you to check is_null instead

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

We can always change it if we discover it hurts performance or readability.

I think the reason it to allow a better chance at LLVM vectorizing the code

AFAIK, arrow normally requires all array entries to be physically valid, even when logically null. That way, one can perform columnar operations blindly and then just use the null masks to ignore the unwanted values after the fact. Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

Copy link
Contributor Author

@alamb alamb Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public API -- 100% agree. This is a private internal API tho, so it seems like we have room to do what we think makes the code simple/maintainable. Pulling important work inside the method instead of requiring all callers to remember it seems like a good example of that.

Sorry -- I wasn't clear -- the reason I was talking about VariantArray::value is that it is the only caller of typed_value_to_variant so if we return Option from this value, we would just be stuck at the same place 🤔

Here, we're accessing is_null and value both inside the loop -- and the latter conditionally -- so I'd be very surprised if LLVM was willing to inject vectorization code that requires touching values the code said not to touch.

Yeah I agree I don't think it will make any difference for performance with Cariants. The primary reason in my mind is to be consistent with other APIs.

I think in an earlier version of this PR I actually had changed value() return Option. Maybe changing the signature is a good idea 🤔

Comment on lines +317 to +320
match typed_value.data_type() {
DataType::Int32 => {
let typed_value = typed_value.as_primitive::<Int32Type>();
Variant::from(typed_value.value(index))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use for downcast_primitive_array macro?

Suggested change
match typed_value.data_type() {
DataType::Int32 => {
let typed_value = typed_value.as_primitive::<Int32Type>();
Variant::from(typed_value.value(index))
downcast_primitive_array! {
typed_value => Some(Variant::from(typed_value.value(index))),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make that work as it needs more information than just the native type (e.g. the precision/scale for decimals).

@superserious-dev 's cast_conversion macro I think can be adapted eventually to handle this case

https://github.com/apache/arrow-rs/pull/8074/files#diff-753713bd9a1945a5fc998fab6018eac0b08ebb4d04d3258b2b615498152a9303R46

Comment on lines +255 to +256
(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
"VariantArray has neither value nor typed_value field",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the shredding spec allows this case. It corresponds to a perfectly shredded all-null column where the writer chose to omit both of them as unnecessary.

(see other comment above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- I'll file a follow on ticket to add support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carpecodeum
Copy link
Contributor

looks great!

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a bunch of thoughts about nesting and pathing. Hopefully helpful.

&self.value_ref
/// Variant arrays can be shredded in one of three states, encoded here
#[derive(Debug)]
pub enum ShreddingState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a future FullyShreddedAllNull variant (neither value nor typed_value present), would we still need to store the metadata even tho it's never actually used? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

I filed a ticket to track adding AllNull:

let met = self.metadata_ref.slice(offset, length);
let val = self.value_ref.slice(offset, length);
let inner = self.inner.slice(offset, length);
let shredding_state = self.shredding_state.slice(offset, length);
Arc::new(Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, would we actually expect a redundant slice call to be notably more expensive than clone?
Seems like they'd do the same thing?

For example, BooleanArray::slice is:

    pub fn slice(&self, offset: usize, length: usize) -> Self {
        Self {
            values: self.values.slice(offset, length),
            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
        }
    }

with BooleanBuffer::slice:

    pub fn slice(&self, offset: usize, len: usize) -> Self {
        assert!(
            offset.saturating_add(len) <= self.len,
            "the length + offset of the sliced BooleanBuffer cannot exceed the existing length"
        );
        Self {
            buffer: self.buffer.clone(),
            offset: self.offset + offset,
            len,
        }
    }

Assuming the compiler inlines things as aggressively as it usually does, it seems like the net difference would be the bounds check and other offset arithmetic.


// if the typed value is null, decode the variant and extract the value
if typed_value.is_null(i) {
// todo follow path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be a path at all for this primitive output builder, sorry?
Seems like that only matters for array and object builders?


// if the typed value is null, decode the variant and extract the value
if typed_value.is_null(i) {
// todo follow path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I see -- for a strongly-typed builder, the caller could do all the pathing and just pass in the correct leaf column here. But this is variant, so we'd actually need to extract the variant leaf value (using a path) on a per-row basis. We'd either have to do all the pathing here (recursively), or caller would have to to extract the struct/array once and recurse on its fields. And the latter would require a row-oriented builder where this builder seems to be column-oriented.

Comment on lines +47 to +48
// Dispatch based on the shredding state of the input variant array
match variant_array.shredding_state() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (big) complexity here -- the requested type could differ wildly from how the input data are currently shredded.

Pathing-wise, we'd need to drill as deeply as possible into the variant array's physical shredding schema, following the requested path. In the happy path, the requested path was shredded and we just have to deal with the corresponding (value, typed_value) pair of columns.

Example

Suppose we're trying to extract v:a.b::INT from a VariantArray with the following physical layout:

v: VARIANT {
    metadata: BINARY,
    value: BINARY,
    typed_value: STRUCT {
        a: SHREDDED_VARIANT {
            value: BINARY,
            typed_value: STRUCT {
                b: SHREDDED_VARIANT {
                    value: BINARY,
                    typed_value: INT,
                },
            },
        },
    },
}

Then we're actually doing v.typed_value.a.typed_value.b::INT -- the builder only has to worry about reconciling v.typed_value.a.typed_value.b.value with v.typed_value.a.typed_value.b.typed_value (which can be passed in directly at construction time)

This case has an even easier example, if the caller requested v:a (an untyped variant extract) -- in that case we simply combine v.typed_value.a with v.metadata into a new VariantArray -- no casting needed because VariantArray can capture the full structure v:a no matter how it was physically shredded.

In a less-happy path, we will hit a partially shredded object where the requested path element is not available in typed_value, and we must go row-oriented from there (with the remaining path extraction performed on a per-row basis):

Example

Suppose we're trying to extract v:a.b::INT from a VariantArray with the following physical layout:

v: VARIANT {
    metadata: BINARY,
    value: BINARY,
    typed_value: STRUCT {
        a: SHREDDED_VARIANT {
            value: BINARY,
            typed_value: STRUCT {
                x: SHREDDED_VARIANT {
                    value: BINARY,
                    typed_value: INT,
                },
            },
        },
    },
}

Then we're actually doing v.typed_value.a:b::INT -- we have to pass v.typed_value.a.value into the builder, which has to extract field b from every row.

In an unhappy path, the partial shredding of the input disagrees with the output's requested shredding, and we have to build up a struct from both shredded and unshredded paths.

Example

Suppose the requested schema asks for both v:a.b::INT and v:a.x::INT, when the VariantArray's layout is:

v: VARIANT {
    metadata: BINARY,
    value: BINARY,
    typed_value: STRUCT {
        a: SHREDDED_VARIANT {
            value: BINARY,
            typed_value: STRUCT {
                b: SHREDDED_VARIANT {
                    value: BINARY,
                    typed_value: INT,
                },
            },
        },
    },
}

Now we have to actually build the struct v.a with two fields by unioning shredded (v.typed_value.a.typed_value.b::INT and unshredded bits (v.typed_value.a.value:x::INT).

In a really unhappy case, the caller requested a shredded variant type that disagrees with the input data's shredding. So some fields have to be shredded and others unshredded.

Example

Suppose we're trying to extract both v:a as a partially shredded variant with x shredded and b not shredded... from a VariantArray where b is shredded and x is not shredded:

v: VARIANT {
    metadata: BINARY,
    value: BINARY,
    typed_value: STRUCT {
        a: SHREDDED_VARIANT {
            value: BINARY,
            typed_value: STRUCT {
                b: SHREDDED_VARIANT {
                    value: BINARY,
                    typed_value: INT,
                },
            },
        },
    },
}

Then not only would we need to path into v.typed_value.a.typed_value.b::VARIANT (which requires unshredding it and which produces an o.typed_value.a.value_b column), we'd also have to path into v.typed_value.a.value::INT (which requires shredding it and which contributes to both o.typed_value.a.value_x and o.typed_value.a.typed_value columns). And then we have to variant-union o.typed_value.a.value_b and o.typed_value.a.value_xcolumns into a single o.typed_value.a.value column, and pack everything into a new VariantArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly, this I think is the key challenge of this code -- each different type of output needs to handle all the possible input shredding types as well.

/// if `as_type` is None, the returned array will itself be a VariantArray.
///
/// if `as_type` is `Some(type)` the field is returned as the specified type.
pub as_type: Option<FieldRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably None and Some(VariantType) are equivalent?

If so, should we just default-initialize as_type: FieldRef to VariantType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... but variant is not a single type... every possible shredding (or lack thereof) is a different physical DataType. So by passing None, we request to keep whatever structure the underlying path already has; by passing Some(... variant type...) we're requesting to [un|re]shred as necessary to make it fit the new shredding schema.

Comment on lines +83 to +84
/// Construct options to get the specified path as a variant.
pub fn new_with_path(path: VariantPath<'a>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess path is optional (defaults to empty) because a leaf extract of a perfectly shredded variant doesn't need any internal pathing -- whoever constructs the builder just passes the leaf columns and the builder only needs to cast them appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the output builder seems to be fully column-oriented -- it assumes that all values for each leaf column are extracted in a tight loop. This can work for primitive builders, but nested builders will quickly run into pathing and efficiency problems.

I think we'll need to do something similar to the JSON builder, with a row-oriented approach where each level of a nested builder receives an already-constructed Variant for the current row and does a field extract for each child builder; child builders can then cast the result directly or recurse further as needed (based on their own type). And then the top-level builder call would construct a Variant for each row to kick-start the process.

But see the other comment -- to the extend that the shredding aligns nicely, we can hoist a subset of this per-row pathing of the append method up to columnar pathing of the builder's constructor and finalizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this aligns with my thinking and is nicely described -- I will use it as the base for the next set of tickets perhaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2025

I have gone though the comments in this PR -- thank you @carpecodeum @Samyak2 and @scovich

I would very much like to unblock other people working on various parts of variant_get in parallel.

Thus, my plans now are:

  1. Ensure the names for shredding align with the Variant shredding spec
  2. Merge this PR in
  3. File a bunch of follow on tickets -- several will be quite substantial like reading shredding objects, lists and pathing. However, I think we can work on much of it in parallel

Comment on lines +165 to +166

// todo for other primitive types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// todo for other primitive types

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I filed a bunch more tickets now


// if the typed value is null, decode the variant and extract the value
if typed_value.is_null(i) {
// todo follow path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +139 to +140
// TODO: try to cast the typed_value to the desired type?
Err(ArrowError::NotYetImplemented(format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if typed_value.data_type() == self.as_type.data_type() {
Ok(typed_value.clone())
} else {
// TODO: try to cast the typed_value to the desired type?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: try to cast the typed_value to the desired type?
// TODO: try to cast the typed_value to the desired type?
// https://github.com/apache/arrow-rs/issues/8086

}
}
dt => {
return Err(ArrowError::NotYetImplemented(format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(ArrowError::NotYetImplemented(format!(
// https://github.com/apache/arrow-rs/issues/8086
return Err(ArrowError::NotYetImplemented(format!(

/// [Parquet Variant Shredding Spec]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding
#[derive(Debug)]
pub enum ShreddingState {
// TODO: add missing state where there is neither value nor typed_value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: add missing state where there is neither value nor typed_value
// TODO: add missing state where there is neither value nor typed_value
// https://github.com/apache/arrow-rs/issues/8088

&self.value_ref
/// Variant arrays can be shredded in one of three states, encoded here
#[derive(Debug)]
pub enum ShreddingState {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

I filed a ticket to track adding AllNull:

Comment on lines +255 to +256
(_metadata_field, None, None) => Err(ArrowError::InvalidArgumentError(String::from(
"VariantArray has neither value nor typed_value field",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb changed the title [Variant] Proposal for variant_get and Shredded VariantArray [Variant] Add variant_get and Shredded VariantArray Aug 7, 2025
@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2025

This implementation is far from complete, but I don't see any objections to the fundamental structure.

So, let's merge this one in and work on improving the situation in follow on PRs

@alamb alamb merged commit c561acb into apache:main Aug 7, 2025
14 checks passed
@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2025

Whops, I forgot to add some links. I will make a new PR for that

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2025

Whops, I forgot to add some links. I will make a new PR for that

@alamb alamb deleted the alamb/variant_shredding_strawman branch August 7, 2025 19:40
alamb added a commit that referenced this pull request Aug 13, 2025
# Which issue does this PR close?


- Follow on to #8021

# Rationale for this change

Let's add links to the relevant tickets in the code so future readers
who encounter it can find the relevant context

# What changes are included in this PR?

Add comments with links to tickets

# Are these changes tested?

N/A (just comments)

# Are there any user-facing changes?

No
alamb added a commit that referenced this pull request Aug 18, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Related to #8021

# Rationale for this change

As part of the review in #8021,
@scovich and I were discussing how `VariantArray::value` should behave
in the presence of nulls:
#8021 (comment)

> Suggest to make this return Option<Variant> so callers don't have to
check for null themselves.

I realized it might not be 100% clear that the existing convention in
this crate was that `value()` methods did not check for nulls / return
`Option`. I think we should document it better

# What changes are included in this PR?

Explicitly document that `value` methods do not check for nulls and
explain what happens when they are used on null values

# Are these changes tested?

Yes, by CI

# Are there any user-facing changes?

Additional documentation. No behavior changes

---------

Co-authored-by: Kyle Barron <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support variant_get kernel for shredded variants
4 participants