Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Aug 12, 2025

A bit of pathfinding relating to variant_get and its connection to variant shredding.

Intended as a conversation-starter, not targeting merge (too big, too incomplete, etc).

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Aug 12, 2025
Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping some notes and observations I took away from this exercise.

Comment on lines -105 to -119
// Find the value field, if present
let value = inner
.column_by_name("value")
.map(|v| {
v.as_binary_view_opt().ok_or_else(|| {
ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
v.data_type()
))
})
})
.transpose()?;

// Find the typed_value field, if present
let typed_value = inner.column_by_name("typed_value");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out into the new VariantArray::from_parts function below.

inner: StructArray,

/// The metadata column of this variant
metadata: BinaryViewArray,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pathfinding, ShreddingState -- already used for referring to both top-level variant columns (fields: metadata, value, typed_value) -- also proved useful for referring to nested shredded variant columns (fields: value, typed_value).

So I hoisted out the metadata column to unlock that new use case.

// This would be a lot simpler if ShreddingState were just a pair of Option... we already
// have everything we need.
let inner = builder.build();
let shredding_state = ShreddingState::try_new(&inner).unwrap(); // valid by construction
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not quite "valid by construction" today, because the enum can't handle the (None, None) case yet.
But it will, soon enough.

Variant::new(self.metadata.value(index), value.value(index))
}
ShreddingState::Typed { typed_value, .. } => {
ShreddingState::PerfectlyShredded { typed_value, .. } => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commentary below for an explanation of these renamed enum variants.

Comment on lines +219 to +253
/// One shredded field of a partially or prefectly shredded variant. For example, suppose the
/// shredding schema for variant `v` treats it as an object with a single field `a`, where `a` is
/// itself a struct with the single field `b` of type INT. Then the physical layout of the column
/// is:
///
/// ```text
/// v: VARIANT {
/// metadata: BINARY,
/// value: BINARY,
/// typed_value: STRUCT {
/// a: SHREDDED_VARIANT_FIELD {
/// value: BINARY,
/// typed_value: STRUCT {
/// a: SHREDDED_VARIANT_FIELD {
/// value: BINARY,
/// typed_value: INT,
/// },
/// },
/// },
/// },
/// }
/// ```
///
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an
/// object at all) or a variant object (partial shredding, `v` was an object but included unexpected
/// fields other than `a`), or is NULL (perfect shredding, `v` was an object containing only the
/// single expected field `a`).
///
/// A similar story unfolds for each `v.typed_value.a.value` -- a variant value if shredding failed
/// (`v:a` was not an object at all), or a variant object (`v:a` was an object with unexpected
/// additional fields), or NULL (`v:a` was an object containing only the single expected field `b`).
///
/// Finally, `v.typed_value.a.typed_value.b.value` is either NULL (`v:a.b` was an integer) or else a
/// variant value.
pub struct ShreddedVariantFieldArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a marginally useful new Array that represents a shredded variant field. Unlike the top-level variant, it has only (value?, typed_value?) fields -- no metadata. The variant shredding spec mandates this, and I hesitate to "just" reuse VariantArray with its metadata field. Writing that out to a parquet file would violate the spec, which seems like too big of a footgun and a large departure from typical arrow arrays that physically match the structure of the underlying parquet.

I say "marginally useful" because:

  • Without a metadata column, one cannot actually define a value method for it.
  • The pathfinding code in this PR only uses it as a thin container -- all the code is built around the shared ShreddingState concept.
  • There is only one use site, in follow_shredded_path_element, so we could work directly with StructArray and just validate it manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @alamb --

If we could somehow figure out to make this be VariantArray rather than ShreddedVariantFieldArray I think that would be the most elegant / understandable

That was my initial reaction as well. But the metadata field becomes problematic. Do we make it optional? Require it anyway and risk writing out invalid parquet if somebody forgets to strip it out or ignore it? etc

Copy link
Contributor Author

@scovich scovich Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: The variant shredding spec doesn't name the type of the group that represents shredded object fields, but it's definitely NOT VARIANT. It's some other thing that must never have a metadata column and which must always nest inside a VARIANT type that provides the metadata column it depends on. And it's required where VARIANT is optional. And it allows both value and typed_value to be NULL/missing where VARIANT forbids that combination.

In other words the two are really not the same thing -- other than they happen to contain two fields with the same name -- and I worry that if we try to conflate the two we're just going to cause a lot of trouble for ourselves.

Edit: Tabular form might be easier to digest:

VARIANT Object field Array element
metadata required forbidden forbidden
group type optional required required
NULL/NULL forbidden allowed forbidden

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Array elements have almost the same funky properties as object fields: required and no metadata, but they forbid the NULL/NULL combo that object fields allow.


/// Builder that extracts a struct. Casting failures produce NULL or error according to options.
#[allow(unused)]
struct StructVariantShreddingRowBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrible name, but I couldn't think of a better one. Even worse name below for partially shredded structs...


/// Used for actual shredding of binary variant values into shredded variant values
#[allow(unused)]
struct ShreddedVariantRowBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is very incomplete, but hopefully has enough of a sketch to give an idea of what the implementation should look like? Solving it requires a way of throwing around variant bytes, so e.g. we can iterate over a stream of variant fields and filter out the unwanted ones to build a new variant object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this also be used to shred unshredded variant values to shredded arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what it would be used for. If the user passed a shredded variant type to variant_get and the path ends inside a binary variant, we have no choice but to shred accordingly (**) (***).

If the user just passed a concrete type to variant_get, and the path ends inside a binary variant, we would instead use e.g. PrimitiveVariantShreddingRowBuilder or StructVariantShreddingRowBuilder.

(**) Things get complicated fast, if the user passed a shredded variant type to variant_get and the path ends inside a shredded variant with a different schema. This pathfinding PR didn't even touch that case, but it would probably look vaguely similar to the struct specialization, where we try to follow the common/compatible paths of the two shredding schemas (in hopes of returning unmodified columns directly) and fall back to actual builders only where the paths diverge.

(***) If the user passed a binary variant schema and the underlying data were shredded, then we have to use a BinaryVariantRowBuilder to insert the variant bytes produced by e.g. VariantArray::value.

Option<&'a mut dyn VariantShreddingRowBuilder>,
),
> {
std::iter::empty()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::iter::empty()
std::iter::empty() // TODO

struct VariantShreddingStructRowBuilder {
metadata: arrow::array::BinaryViewArray,
nulls: NullBufferBuilder,
value_builder: BinaryVariantRowBuilder,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a placeholder... I'm not quite sure what form the value builder (and its corresponding null masks) should actually take?

}
}

impl<'a> From<&[VariantPathElement<'a>]> for VariantPath<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntoIterator (below) requires values, and iterators to slice references return references.

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

I added a link to #8083 to the description of this PR

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

I plan to review this first thing tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich

TLDR is that this PR looks like it is on the right track to me. Maybe we can start breaking it down into pieces to make it easier to review and test

Some thoughts of things that might make their own good PRs

  1. Add tests -- for example, perhaps we can explicitly re-create the data from the spec here https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects as a test
  2. Add the CopyingVariantBuilder / a way to copy parts of variants around
  3. Rename the variants of ShreddingState
  4. Create set of traits for VariantAsPrimitive, etc type traits for use converting to the arrow types

It seems like this PR has both sketches of reading shredded variant values as well as writing shredded variant which is quite cool; I know they are somewhat related, but maybe we can work on them one at a time

I am also happy to schedule a call to discuss next steps face to face -- feel free to drop me an email at [email protected] perhaps. I could also schedule something face to face for any contributor who might like to join us

/// }
/// ```
///
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In going through this explanation, I think it might be helpful to make some specific examples of valid values. Maybe we can add them to the docs too -- something like

Fields could be shredded

{
  "a" : {
    "b": 42 <-- stored as an integer in typed_value field
  }
}

Field could not be shredded

{
  "a" : {
    "b": "foo" <-- stored as a variant
  }
}

Extra non shredded fields

{
  "a" : {
    "b": 42, <-- stored as an integer in typed_value field
    "z": "foo" <-- stored as a variant in sub field?
  }
}

But I am not sure about the last example, to be honest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those examples all look correct, yes.

/// Partially shredded:
/// * value is an object
/// * typed_value is a shredded object.
PerfectlyShredded { typed_value: ArrayRef },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that if typed_value was non null, it contains the value and an implementation doesn't even have to check the value field

If typed_value is null, then we have to check the value field and produce a null in the output when needed

///
/// TODO: move to arrow crate
#[derive(Debug, Default, Clone)]
pub struct StructArrayBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is already a class called StructBuilder which helps build struct arrays row by row.

https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html

Maybe we could add StructArrayBuilder to the same location. With sufficient documentation examples, I think it would be straightforward to add

Is this something it would help if I looked into?


mod output;

pub(crate) enum ShreddedPathStep<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this makes lots of sense

/// deeper. For a `VariantPathElement::Field`, the step fails if there is no `typed_value` at this
/// level, or if `typed_value` is not a struct, or if the requested field name does not exist.
///
/// TODO: Support `VariantPathElement::Index`? It wouldn't be easy, and maybe not even possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we would likely have to copy the relevant bytes to a new arary


let field = field
.as_any()
.downcast_ref::<ShreddedVariantFieldArray>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could somehow figure out to make this be VariantArray rather than ShreddedVariantFieldArray I think that would be the most elegant / understandable

Comment on lines +205 to +208
/// TODO: How would a caller request a struct or list type where the fields/elements can be any
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or
/// list and then try to assemble the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one way to handle this case would be to extract the fields your code wants to work with as individual arrays, rather than trying to get back a single StructArray with the relevant fields

We don't really have a great story at the moment even for casting one struct array to another when the fields don't match up. See, for example, the discussion in

Therefore I think maybe we can actually punt on casting the result to a proper struct array recursively (we can certainly add a TODO item for a follow on PR)

}

fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> {
if let Some(v) = value.as_primitive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very clever

Comment on lines +134 to +139
// We need a way to convert a Variant directly to bytes. In particular, we want to just copy
// across the underlying value byte slice of a `Variant::Object` or `Variant::List`, without
// any interaction with a `VariantMetadata` (because we will just reuse the existing one).
//
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a
// lot of unnecessary work and would also create a new metadata column we don't need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we are missing some way to copy a variant or some subpart of the variant to a new Variant.

I wonder if we could make something that implements VariantBuilderExt that is designed for this usecase

https://github.com/apache/arrow-rs/blob/e845411dbf26a10da072af772b7cd98f9f05d0b5/parquet-variant/src/builder.rs#L1542-L1541

Something like

let (metadata, value) = get_variant(); // exitsting variant at metadata
let variant = Variant::new(metadata, value);

// Create a new variant builder for copying variants without recreating the metadata
let mut builder = CopyingVariantBuilder::new(existing_metadata, capacity);
// copying this value just copies the bytes, does not re-create the metadata
// errors if the metadata pointer doens't match maybe?
builder.append_value(&variant)?;

🤔


/// Used for actual shredding of binary variant values into shredded variant values
#[allow(unused)]
struct ShreddedVariantRowBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this also be used to shred unshredded variant values to shredded arrays?

@scovich
Copy link
Contributor Author

scovich commented Aug 13, 2025

It seems like this PR has both sketches of reading shredded variant values as well as writing shredded variant which is quite cool; I know they are somewhat related, but maybe we can work on them one at a time

IMO they're more than just "somewhat related" -- if you squint, shredding is precisely a variant_get call where the input is a plain binary variant, the path is empty, and the requested type is the desired shredding schema. Even if the actual shredding implementation did something more clever in order to skip the intermediate binary variant representation, variant_get still has to handle that and more. For example, variant_get also has to unshred, in case the caller requested plain binary variant but the input was shredded. And also has to reshred, in case the caller requested a different shredding schema than the input has.

@alamb
Copy link
Contributor

alamb commented Aug 13, 2025

IMO they're more than just "somewhat related" -- if you squint, shredding is precisely a variant_get call where the input is a plain binary variant, the path is empty, and the requested type is the desired shredding schema.

I see -- that is a fascinating duality and I agree it is the same

I still think we will be better off if we can somehow split this up into smaller pieces, so I was trying to find points where we could carve out parts

@scovich
Copy link
Contributor Author

scovich commented Aug 13, 2025

I still think we will be better off if we can somehow split this up into smaller pieces, so I was trying to find points where we could carve out parts

Oh, for sure. This pathfinding PR covers way too much territory to actually merge as a single PR.

Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb -- Thanks for the insightful comments, I replied to several below.

/// a: SHREDDED_VARIANT_FIELD {
/// value: BINARY,
/// typed_value: STRUCT {
/// a: SHREDDED_VARIANT_FIELD {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// a: SHREDDED_VARIANT_FIELD {
/// b: SHREDDED_VARIANT_FIELD {

/// }
/// ```
///
/// In the above, each row of `v.value` is either a variant value (shredding failed, `v` was not an
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those examples all look correct, yes.

/// Partially shredded:
/// * value is an object
/// * typed_value is a shredded object.
PerfectlyShredded { typed_value: ArrayRef },
Copy link
Contributor Author

@scovich scovich Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that if typed_value was non null, it contains the value and an implementation doesn't even have to check the value field

For primitive values and arrays, yes -- (NULL, NULL) is an invalid case and (Variant::Null, NULL) translates to Variant::Null (which casts to NULL of any other type).

Objects have a multi-step process to access field f:

  1. Is the typed_value NULL?
    • Yes => Not even an object => have to use value (or just fail, since nothing casts to struct)
      • It is illegal for value to also be NULL, because this is the top-level
    • No => [partially] shredded object => continue (ignore value)
  2. Does the column typed_value.f exist?
    • No => f was not in the shredding schema => have to check value which may or may not be a variant object
    • Yes => f was in the shredding schema => continue
  3. Is typed_value.f.typed_value NULL?
    • Yes => This row does not contain f, or f is the wrong type => fall back to typed_value.f.value
      • If that is also NULL then f is NULL; otherwise, f could be any variant object (including Variant::Null)
    • No => This row contains f and it is shredded (ignore typed_value.f.value)

///
/// TODO: move to arrow crate
#[derive(Debug, Default, Clone)]
pub struct StructArrayBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on whether we think this is a good builder idiom to introduce?
At least, I haven't seen anything like it elsewhere in arrow-rs?

One nice property of this builder is it eliminates the risk of mismatch between field type and array type. But a public version of this API would still need to have a fallible alternative, because array lengths could still mismatch and cause a panic.


let field = field
.as_any()
.downcast_ref::<ShreddedVariantFieldArray>()
Copy link
Contributor Author

@scovich scovich Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we continue this discussion in the other comment thread #8122 (comment), where I outlined the pros and cons of this struct?

Comment on lines +205 to +208
/// TODO: How would a caller request a struct or list type where the fields/elements can be any
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or
/// list and then try to assemble the results.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one way to handle this case would be to extract the fields your code wants to work with as individual arrays, rather than trying to get back a single StructArray with the relevant fields

Agreed, but that would be pretty inefficient if each individual leaf column needs a separate variant_get call that has to repeat the work of pathing through common path prefixes. I suppose the caller could take on the work of memoizing common path prefixes, but then they're creating a lot of intermediate arrays. Not just annoying, but also slow because it forces columnar access for row-wise pathing.

We could potentially define a bulk API like variant_get_many, but then we would need to somehow identify the common paths, deduplicate them, and then recursively process them... and that internal recursion would still need a way to say "just give me whatever is at the end of each path". If we found an elegant way to communicate that internally, we could just make that solution the public API and be done with it.

Comment on lines +205 to +208
/// TODO: How would a caller request a struct or list type where the fields/elements can be any
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or
/// list and then try to assemble the results.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can actually punt on casting the result to a proper struct array recursively

Assuming the caller did, in fact, ask for a fully strongly typed struct (no variant leaf fields), I'm pretty sure this PR already covers the solution? Even if they ask for a variant leaf field, they just pay a performance penalty to shred/unshred/reshred each field if the underlying shredding doesn't match the requested shredding.

Comment on lines +205 to +208
/// TODO: How would a caller request a struct or list type where the fields/elements can be any
/// variant? Caller can pass None as the requested type to fetch a specific path, but it would
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or
/// list and then try to assemble the results.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, an idea!

What if we introduce a companion utility, variant_get_schema, which instead of returning an ArrayRef returns a DataType?

That way, a caller who wants to extract one or more variant fields as part of a larger struct can fetch the actual variant shredding schema for each such field and splice it into the data type they pass to variant_get.

For example, suppose I want to work with the following logical schema:

STRUCT {
    a: STRUCT {
        id: INT,
        payload: VARIANT,
    },
    b: STRUCT {
        key: STRING,
        payload: VARIANT,
    },
}

Then I could make two calls to variant_get_schema followed by a (now precisely targeted) call to variant_get:

// hand-waving the path argument for simplicity...
let a_payload_type = variant_get_schema(input, &["a", "payload"])?;
let b_payload_type = variant_get_schema(input, &["b", "payload"])?;
let a_type = DataType::Struct(Fields::from([
    Field::new("id", ...),
    Field::new("payload", a_payload_type, ...),
]));
let b_type = DataType::Struct(Fields::from([
    Field::new("key", ...),
    Field::new("payload", b_payload_type, ...),
]));
let data_type = DataType::Struct(Fields::from([
    Field::new("a", a_type, ...),
    Field::new("b", b_type, ...),
]));
let data = variant_get(input, &[], Some(data_type))?;

If the path ended on a shredded variant, variant_get_schema would return the corresponding shredding schema, allowing variant_get to return the corresponding sub-arrays unchanged. Otherwise, it would just return a binary variant schema, since variant_get would anyway have to extract the bytes row by row.

Comment on lines +134 to +139
// We need a way to convert a Variant directly to bytes. In particular, we want to just copy
// across the underlying value byte slice of a `Variant::Object` or `Variant::List`, without
// any interaction with a `VariantMetadata` (because we will just reuse the existing one).
//
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a
// lot of unnecessary work and would also create a new metadata column we don't need.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thought! I had forgotten about VariantBuilderExt. It almost solves the issue with building up (or filtering down) partially shredded objects -- its new_object method could return a builder used to add the desired fields. Except (a) ObjectBuilder is a concrete type, not a trait, so we can't customize its behavior; and (b) that still needs to mess with string field names and metadata dictionaries. We'd really want to define a raw VariantObjectField -- basically a (field_id, bytes) pair -- and have a method for appending that to the builder.


/// Used for actual shredding of binary variant values into shredded variant values
#[allow(unused)]
struct ShreddedVariantRowBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what it would be used for. If the user passed a shredded variant type to variant_get and the path ends inside a binary variant, we have no choice but to shred accordingly (**) (***).

If the user just passed a concrete type to variant_get, and the path ends inside a binary variant, we would instead use e.g. PrimitiveVariantShreddingRowBuilder or StructVariantShreddingRowBuilder.

(**) Things get complicated fast, if the user passed a shredded variant type to variant_get and the path ends inside a shredded variant with a different schema. This pathfinding PR didn't even touch that case, but it would probably look vaguely similar to the struct specialization, where we try to follow the common/compatible paths of the two shredding schemas (in hopes of returning unmodified columns directly) and fall back to actual builders only where the paths diverge.

(***) If the user passed a binary variant schema and the underlying data were shredded, then we have to use a BinaryVariantRowBuilder to insert the variant bytes produced by e.g. VariantArray::value.

}
}

/// Builder that extracts a struct. Casting failures produce NULL or error according to options.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataType::Struct -> StructArray

/// variant? Caller can pass None as the requested type to fetch a specific path, but it would
/// quickly become annoying (and inefficient) to call `variant_get` for each leaf value in a struct or
/// list and then try to assemble the results.
pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> {
Copy link
Contributor Author

@scovich scovich Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a call, ideas to make progress:

  1. Support variant_get for any input (shredded or otherwise), any depth of object field path steps, and casting to e.g. Some(DataType::Int32). This should close the loop and is potentially a good candidate for first PR.
  2. Support variant_get for any path and no data type. This would extract whatever variant sub-value is found at the path's end
    • Needs value byte copy capability
  3. Support variant_get for Some(DataType::Struct) (nested shredding)
    • Needs read-only metadata builder capability
  4. Support variant_get for Some(DataType::VARIANT)
    • Needs to build out the rest of the VariantShreddingRowBuilder implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏 -- my plan is to try and turn these into tickets tomorrow morning

Copy link
Contributor

alamb added a commit that referenced this pull request Sep 8, 2025
# Which issue does this PR close?

- Closes #8150
- closes #8166

# Rationale for this change

Add support for extracting fields from both shredded and non-shredded
variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them
to Int32 with proper NULL handling for type mismatches.

NOTE: This is a second attempt at  
* #8166 
which suffered strong logical conflicts with
* #8179
and which itself grew out of
* #8122

See the other two PR for the vast majority of review commentary relating
to this change.

I started from the original PR commits (first three), performed the
merge, and fixed up a bunch of issues.

Manually diffing the before
([76b75ee..882aa4d](https://github.com/apache/arrow-rs/compare/76b75eebc..882aa4d69))
and after
([0ba91ae..f6fd915](https://github.com/apache/arrow-rs/compare/0ba91aed9..f6fd91583))
diffs gives the following non-trivial differences vs. the original PR
* Ran `cargo fmt`
* `typed_value_to_variant` now supports all primitive numeric types
(previously only int16)
* cast options plumbed through and respected
* Fix a null buffer bug in `shredded_get_path` -- the original code was
wrongly unioning in the null buffer from `typed_value` column:
    ```patch
// Path exhausted! Create a new `VariantArray` for the location we
landed on.
- // Also union nulls from the final typed_value field we landed on
    - if let Some(typed_value) = shredding_state.typed_value_field() {  
    -     accumulated_nulls = arrow::buffer::NullBuffer::union(         
    -         accumulated_nulls.as_ref(),    
    -         typed_value.nulls(),         
    -     );                  
    - }                
    let target = make_target_variant(    
        shredding_state.value_field().cloned(),     
        shredding_state.typed_value_field().cloned(),   
        accumulated_nulls,               
    );     
    ```
* Remove the `get_variant_perfectly_shredded_int32_as_variant` test
case, because #8179 introduced a
battery of unit tests that cover the same functionality.
* Remove now-unnecessary `.unwrap()` calls from object builder `finish`
calls in unit tests
* Fixed broken test code in `create_depth_1_shredded_test_data_working`,
which captured the return value of a nested builder's `finish` (`()`)
instead of the return value of the top-level builder. I'm not quite sure
what this code was trying to do, but I changed it to just create a
nested builder instead of a second top-level builder:
    ```patch
      fn create_depth_1_shredded_test_data_working() -> ArrayRef {
// Create metadata following the working pattern from
shredded_object_with_x_field_variant_array
          let (metadata, _) = { 
    -         let a_variant = {
- let mut a_builder = parquet_variant::VariantBuilder::new();
    -             let mut a_obj = a_builder.new_object();
- a_obj.insert("x", Variant::Int32(55)); // "a.x" field (shredded when
possible)
    -             a_obj.finish().unwrap()
    -         }; 
              let mut builder = parquet_variant::VariantBuilder::new();
              let mut obj = builder.new_object();
    -         obj.insert("a", a_variant);
    +
    +         // Create the nested "a" object 
    +         let mut a_obj = obj.new_object("a");
    +         a_obj.insert("x", Variant::Int32(55)); 
    +         a_obj.finish();
    +
              obj.finish().unwrap();
              builder.finish()
          };
    ```
* Similar fix (twice, `a_variant` and `b_variant`) for
`create_depth_2_shredded_test_data_working`
* `make_shredding_row_builder` now supports signed int and float types
(unsigned int not supported yet)
* A new `get_type_name` helper in row_builder.rs that gives
human-readable data type names. I'm not convinced it's necessary (and
the code is in the wrong spot, jammed in the middle of
`VariantAsPrimitive` code.
* `impl VariantAsPrimitive` for all signed int and float types
* `PrimitiveVariantShreddingRowBuilder` now has a lifetime param because
it takes a reference to cast options (it now respects unsafe vs. safe
casting)

# What changes are included in this PR?

Everything in the original PR, plus merge in the main branch, fix
logical conflicts and fix various broken tests.

# Are these changes tested?

All unit tests now pass.

# Are there any user-facing changes?

No (variant is not public yet)

---------

Co-authored-by: carpecodeum <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants