[VARIANT] Path-based Field Extraction for VariantArray #7946

carpecodeum · 2025-07-16T21:51:40Z

Which issue does this PR close?

This PR implements efficient path-based field extraction and manipulation capabilities for VariantArray, enabling direct access to nested fields without expensive unshredding operations.
Follow-up on #7919

Closes [Variant] Support variant_get kernel for shredded variants #7941

Rationale for this change

This work builds directly on the path navigation concepts introduced in #7919, sharing the fundamental VariantPathElement design with Field and Index variants. While PR #7919 provided a compute kernel approach with a variant_get function, this PR provides instance-based methods directly on VariantArray with a builder API using owned strings rather than PR #7919 vector-based approach.

This is a draft still, as the changes for #7919 got merged today, I still have to incorporate those changes, and looking forward to reviews and suggestions.

This PR is complementary to #7921, which implements schema-driven shredding during array construction. This PR provides runtime path-based access to both shredded and unshredded data, creating a complete solution for both efficient construction and efficient access of variant data.

Big Thanks to @mprammer @PinkCrow007 for their continued support throughout my Variant exploration

What changes are included in this PR?

Field removal operations through methods like remove_field and remove_fields enable removal of specific fields from variant data, crucial for shredding operations where temporary or debug fields need to be stripped. field_operations.rs provides direct binary manipulation through functions like get_path_bytes, extract_field_bytes, and remove_field_bytes that operate on raw binary format without constructing intermediate objects. variant_parser.rs supports all variant types with parsers for 17 different primitive types, providing the foundation for efficient binary navigation.

The performance-critical byte operations could serve as the underlying implementation for PR #7919's compute kernel, potentially providing better performance for batch operations by avoiding object construction overhead. The field removal capabilities could extend PR #7919's functionality beyond extraction to comprehensive field manipulation. The instance-based approach provides different ergonomics that complement PR #7919's compute kernel approach.

This PR focuses on runtime access and manipulation rather than construction-time optimization, leaving build-time schema-driven shredding to PR #7921. Future work is integration with PR #7919's compute kernel approach, potentially using this PR's byte-level operations as the underlying implementation.

Are these changes tested?

Yes, tests are added

Are there any user-facing changes?

Not yet

carpecodeum · 2025-07-16T21:53:19Z

CC - @alamb @Samyak2 @friendlymatthew @scovich

alamb

Thank you for this PR @carpecodeum

This is very cool

I think there is already a variant_get implementation in https://github.com/apache/arrow-rs/blob/d809f19bc0fe2c3c1968f5111b6afa785d2e8bcd/parquet-variant-compute/src/variant_get.rs#L35-L34 contributed by @Samyak2

To take the next steps and implement shredding I think we will need two things:

A way to create shredded variants
A way to represent shredded variants

The idea of removing fields from Variants is interesting, though I wonder if that is an operation we would ever want to do on single Variant instance -- it seems like removing fields for shredding will require copying the underlying bytes anyways, so I was thinking we might just want to create an output variant array entirely

Something like

fn variant_shred(input: VariantArray, output: VariantArray, schema: SchemaRef)

Maybe it is worth looking at how the java or go implementations work

parquet-variant-compute/.cargo/config.toml

parquet-variant-compute/examples/field_removal.rs

alamb · 2025-07-17T16:17:41Z

parquet-variant-compute/src/field_operations.rs

+use parquet_variant::VariantMetadata;
+use std::collections::HashSet;
+
+/// Represents a path element in a variant path


i think if you merge up from main this code will no longer be required

i agree, a lot of this becomes redundant but get_field_bytes is able to get field bytes from an object at the byte level, will this wont be required too?

I am not quite sure what you are asking.

I think it might help to move shredded variant forward by writing the tests / examples of how variant_get should work with shredded arrays

I tried to work up a simple example here:

[Variant] WIP Tests for variant_get of shredded variants #7965

I saw your PR it looks great, im myself trying to work up a few examples

One idea would to be to try and create the other examples from https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/ in code

parquet-variant-compute/src/variant_array.rs

carpecodeum · 2025-07-18T19:32:48Z

Thank you for this PR @carpecodeum

This is very cool

I think there is already a variant_get implementation in https://github.com/apache/arrow-rs/blob/d809f19bc0fe2c3c1968f5111b6afa785d2e8bcd/parquet-variant-compute/src/variant_get.rs#L35-L34 contributed by @Samyak2

To take the next steps and implement shredding I think we will need two things:

A way to create shredded variants

A way to represent shredded variants

The idea of removing fields from Variants is interesting, though I wonder if that is an operation we would ever want to do on single Variant instance -- it seems like removing fields for shredding will require copying the underlying bytes anyways, so I was thinking we might just want to create an output variant array entirely

Something like
fn variant_shred(input: VariantArray, output: VariantArray, schema: SchemaRef)
Maybe it is worth looking at how the java or go implementations work

Is there any issue for implementing this? I would love to work on it

alamb · 2025-07-18T19:35:50Z

Is there any issue for implementing this? I would love to work on it

I think we are discussing reading shredded variants on

[Variant] Support variant_get kernel for shredded variants #7941

We are discussing creating shredded variants on

[Variant] API to construct Shredded Variant Arrays #7895

I don't think we have enough of an idea of how this will work to break them down into finer grained tasks yet.

scovich

Partial review.

The overriding takeaway for me (which got in the way of reviewing the actual PR functionality) is that this PR has a dangerously high level of code duplication vs. existing code in the parquet-variant crate. Dangerous because it includes many bugs and inefficiencies that the other code already solved (which is ~inevitable with such a high amount of redundancy).

Can we please take the time to harmonize with existing code so reviewers can focus on the really important and exciting net new functionality instead of debugging a reinvented wheel?

parquet-variant-compute/src/variant_array.rs

scovich · 2025-07-21T12:19:28Z

parquet-variant-compute/src/variant_array.rs

+        if let Some(obj) = variant.as_object() {
+            let mut field_names = Vec::new();
+            for i in 0..obj.len() {
+                if let Some(field_name) = obj.field_name(i) {


Seems like this should be an unwrap, since the only way to get None here is by an out of bounds index?

parquet-variant-compute/src/variant_array.rs

parquet-variant-compute/src/variant_parser.rs

parquet-variant-compute/src/field_operations.rs

carpecodeum · 2025-07-21T13:04:14Z

Partial review.

The overriding takeaway for me (which got in the way of reviewing the actual PR functionality) is that this PR has a dangerously high level of code duplication vs. existing code in the parquet-variant crate. Dangerous because it includes many bugs and inefficiencies that the other code already solved (which is ~inevitable with such a high amount of redundancy).

Can we please take the time to harmonize with existing code so reviewers can focus on the really important and exciting net new functionality instead of debugging a reinvented wheel?

Hi @scovich yes, currently this is still in progress, which essentially means that I am still working on removing the redundancy, its just taking a bit more time than I thought for me because I had to focus on some other stuff the past few days, I still have it in draft as to just make sure people dont confuse it as a ready-to-go PR, but thanks for this review its very helpful.

scovich · 2025-07-21T19:52:25Z

Partial review.
The overriding takeaway for me (which got in the way of reviewing the actual PR functionality) is that this PR has a dangerously high level of code duplication vs. existing code in the parquet-variant crate. Dangerous because it includes many bugs and inefficiencies that the other code already solved (which is ~inevitable with such a high amount of redundancy).
Can we please take the time to harmonize with existing code so reviewers can focus on the really important and exciting net new functionality instead of debugging a reinvented wheel?

Hi @scovich yes, currently this is still in progress, which essentially means that I am still working on removing the redundancy, its just taking a bit more time than I thought for me because I had to focus on some other stuff the past few days, I still have it in draft as to just make sure people dont confuse it as a ready-to-go PR, but thanks for this review its very helpful.

Ah, sorry if I was over-eager to review this PR. LMK if there are specific things that would be helpful to review, or if it's better to just wait for it to finish baking?

…r functionalities

carpecodeum · 2025-07-23T00:52:23Z

After taking another look at both this PR and other PRs that have been opened or merged over the week, it seems like the only non-duplicate portions of this PR are the "get_field_names" and the "remove_field(s)" functionality. I've deduplicated this PR and cleaned things up to reflect this current status.
@scovitch has mentioned (#7935) that efficient, in-place Variant mutators are a piece that's still missing. Field removal naturally fits into a PR focused on that functionality.
I can take "add_field(s)" if we'd like both append and removal functionality as part of the initial implementation. Separately, convenience functions to streamline interacting with Variants seem to be absent as well. Is there an interest in implementing traits for Variants like Index/IndexMut? I'm happy to take either of these efforts.

Thanks @scovitch and @alamb for reviewing. Do we think this PR is both substantial and useful enough on its own? Or, do we want to close this PR and open one that's more targeted? I'm happy to follow through with either, favoring whatever the reviewers think would keep things clean.

scovich · 2025-07-23T02:33:24Z

convenience functions to streamline interacting with Variants seem to be absent as well. Is there an interest in implementing traits for Variants like Index/IndexMut? I'm happy to take either of these efforts.

Problem is, those traits return references (which the compiler then transparently dereferences). AFAIK, it's impossible to implement those traits if the return value is "manufactured" by the function call itself, because that would require returning a reference to a temporary object that goes out of scope as soon as the function returns.

... which is why we have Index for VariantMetadata (which does return a reference to an underlying string entry), but not for Variant

alamb · 2025-07-24T22:00:10Z

👋 Hi @carpecodeum -- I was checking in on this PR and seeing if it is ready for another round of review

I am not sure if you have seen, but @rdblue gave us an early holiday 🎁 in the form of example shredded parquet variant files in parquet-testing: apache/parquet-testing#90

alamb · 2025-07-24T22:06:13Z

@scovitch has mentioned (#7935) that efficient, in-place Variant mutators are a piece that's still missing. Field removal naturally fits into a PR focused on that functionality.

I don't really understand why in-place mutations are needed to implement get for VariantArray. I think that would only be necessary (potentially) for writing shredded values 🤔

It seems to me that we already have

Variant::get_path to get some arbitrary path from a single Variant:

arrow-rs/parquet-variant/src/variant.rs

Lines 1076 to 1103 in 16794ab

    
               /// # Example 
        
               /// ``` 
        
               /// # use parquet_variant::{Variant, VariantBuilder, VariantObject, VariantPath}; 
        
               /// # let mut builder = VariantBuilder::new(); 
        
               /// # let mut obj = builder.new_object(); 
        
               /// # let mut list = obj.new_list("foo"); 
        
               /// # list.append_value("bar"); 
        
               /// # list.append_value("baz"); 
        
               /// # list.finish(); 
        
               /// # obj.finish().unwrap(); 
        
               /// # let (metadata, value) = builder.finish(); 
        
               /// // given a variant like `{"foo": ["bar", "baz"]}` 
        
               /// let variant = Variant::new(&metadata, &value); 
        
               /// // Accessing a non existent path returns None 
        
               /// assert_eq!(variant.get_path(&VariantPath::from("non_existent")), None); 
        
               /// // Access obj["foo"] 
        
               /// let path = VariantPath::from("foo"); 
        
               /// let foo = variant.get_path(&path).expect("field `foo` should exist"); 
        
               /// assert!(foo.as_list().is_some(), "field `foo` should be a list"); 
        
               /// // Access foo[0] 
        
               /// let path = VariantPath::from(0); 
        
               /// let bar = foo.get_path(&path).expect("element 0 should exist"); 
        
               /// // bar is a string 
        
               /// assert_eq!(bar.as_string(), Some("bar")); 
        
               /// // You can also access nested paths 
        
               /// let path = VariantPath::from("foo").join(0); 
        
               /// assert_eq!(variant.get_path(&path).unwrap(), bar); 
        
               /// ```

The variant_get compute kernel to extract a column of variants from a VariantArray :

arrow-rs/parquet-variant-compute/src/variant_get.rs

Line 35 in 16794ab

pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> {

What we are missing from the read side from what I can tell is the ability to read shredded variants (aka VariantArrays where the underlying array has a typed_value field)

What would you think about trying to implement the code to make the tests in #7965 work?

@carpecodeum

# Which issue does this PR close? - Part of #6736 - Closes #7941 - Closes #7965 # Rationale for this change This is has a proposal for how to structure shredded `VariantArray`s and the `variant_get` kernel If people like the basic idea I will file some more tickets to track additional follow on work It is based on ideas ideas from @carpecodeum in #7946 and @scovich in #7915 I basically took the tests from #7965 and the conversation with @scovich recorded from #7941 (comment) and I bashed out how this might look # What changes are included in this PR? 1. Update `VariantArray` to represent shredding 2. Add code to `variant_get` to support extracting paths as both variants and typed fields 3. A pattern that I think can represent shredding and extraction 4. Tests for same Note there are many things that are NOT in this PR that I envision doing as follow on PRs: 1. Support and implementing `Path`s 2. Support for shredded objects 3. Support shredded lists 4. Support nested objects / lists 5. Full casting support 6. Support for other output types: `StringArray`, `StringViewArray`, etc 8. Many performance improvements # Are these changes tested? Yes # Are there any user-facing changes? New feature --------- Co-authored-by: Samyak Sarnayak <[email protected]> Co-authored-by: Ryan Johnson <[email protected]>

carpecodeum force-pushed the variant-shredding branch from 18d88b0 to b1afed1 Compare July 16, 2025 21:56

alamb reviewed Jul 17, 2025

View reviewed changes

carpecodeum force-pushed the variant-shredding branch 2 times, most recently from 9a616b5 to c712747 Compare July 18, 2025 18:39

alamb reviewed Jul 18, 2025

View reviewed changes

parquet-variant-compute/src/variant_array.rs Outdated Show resolved Hide resolved

scovich reviewed Jul 21, 2025

View reviewed changes

github-actions bot added parquet Changes to the parquet crate and removed parquet Changes to the parquet crate labels Jul 21, 2025

carpecodeum added 17 commits July 22, 2025 19:25

[ADD] Path-based field extraction for VariantArray

4c1d6f2

[FIX] sanitise variant_array file

5ac22a7

[ADD] add hybrid approach for field access

1ef8926

[FIX] fix variant_array implementation

d782197

[ADD] add support for path operations on different data types

948bb39

[FIX] minor fixes

e16af07

[FIX] fix formatting issues

3da46b8

[FIX] remove redundancy

7c03e21

[FIX] improve the tests

eb8bb69

[FIX] refactor code for modularity

397c717

[FIX] fix issues with the spec

dda30ea

remove redundancy with field_operations.rs and variant_parser.rs

32c55ea

[REMOVE] revert field_operations.rs

3b3c191

[REMOVE] remove extra lines in cargo.toml

01f0be7

[REMOVE] remove variant_parser.rs file as decoder.rs already has majo…

eb23834

…r functionalities

[FIX] make code modular

cc5e149

[FIX] clippy and lint issues

30e9cd2

[FIX] remove unsafe functions doing byte operations

7dd6c23

carpecodeum force-pushed the variant-shredding branch from 572af94 to 7dd6c23 Compare July 23, 2025 00:49

github-actions bot added the parquet Changes to the parquet crate label Jul 23, 2025

alamb mentioned this pull request Jul 29, 2025

[Variant] Add variant_get and Shredded VariantArray #8021

Merged

[VARIANT] Path-based Field Extraction for VariantArray #7946

Are you sure you want to change the base?

[VARIANT] Path-based Field Extraction for VariantArray #7946

Uh oh!

Conversation

carpecodeum commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

carpecodeum commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

carpecodeum Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

carpecodeum Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carpecodeum commented Jul 18, 2025

Uh oh!

alamb commented Jul 18, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

scovich Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carpecodeum commented Jul 21, 2025

Uh oh!

scovich commented Jul 21, 2025

Uh oh!

carpecodeum commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented Jul 23, 2025

Uh oh!

alamb commented Jul 24, 2025

Uh oh!

alamb commented Jul 24, 2025

Uh oh!

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading

carpecodeum commented Jul 16, 2025 •

edited

Loading

carpecodeum commented Jul 23, 2025 •

edited

Loading