Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 9, 2025

Note to reviewers: This PR includes 1600+ LoC of new unit tests. The actual changes are half that big.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

The original cast_to_variant code did columnar conversions of types to variant. For primitive types this worked ok, but for deeply nested types it means repeatedly creating new variants (and variant metadata), only to re-code them by copying the variant values to new arrays (with new metadata and field ids). Very expensive.

What changes are included in this PR?

Follow the example of #8280, and introduce a row builder concept that takes individual array values and writes them to an impl VariantBuilderExt. Row builders for complex types instantiate the appropriate list or object builder to pass to their children.

Are these changes tested?

Existing unit tests continue to pass. Extensive new unit tests added as well.

Are there any user-facing changes?

  • VariantBuilderExt has a new append_null method.
  • ObjectFieldBuilder moved to builder.rs and made public

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 9, 2025
Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few hopefully useful notes for reviewers

/// Convert the input array to a `VariantArray` row by row, using `method`
/// not requiring a generic type to downcast the generic array to a specific
/// array type and `cast_fn` to transform each element to a type compatible with Variant
#[allow(unused)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These macros are fully unused now that #8280 has merged, but I wasn't sure if they are useful to keep around "just in case" ?

Comment on lines +206 to +207
/// Appending NULL to a variant array produces an actual NULL value
fn append_null(&mut self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new append_null capability was critical to making all this work:

  • Appending NULL to a variant array builder produces an actual NULL
  • Appending NULL to an object field builder is a no-op (field is missing, as per variant spec)
  • Otherwise, appending NULL to a variant value or variant list produces Variant::Null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will overwrite any value that is currently set -- so if someone did

builder.append_value(..);
builder.append_null(); // will wipe out the value ??

I am not sure this is a problem or not, I just wanted to confirm my understanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. The variant builder API is pretty broken in general if builder methods are invoked more than once, so IMO this isn't making it any worse.

Ok(())
}

struct ObjectFieldBuilder<'o, 'v, 's> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to builders.rs

TBH, I never quite understood why it was defined here in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was mostly gutted (but unit tests left in place), with the real work now being done by row builders defined in arrow_to_variant.rs via make_arrow_to_variant_row_builder.

Comment on lines +69 to +71
// TODO do we need a cast_with_options to allow specifying conversion behavior,
// e.g. how to handle overflows, whether to convert to Variant::Null or return
// an error, etc. ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither changed nor moved... it's just diff weirdness.

Comment on lines +698 to +702
// WARNING: normalized_keys panics if values is empty
let normalized_keys = match values.len() {
0 => Vec::new(),
_ => dict_array.normalized_keys(),
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: I'm pretty sure the original code did not handle this corner case

Ok(())
}

fn set_run_for_index(&mut self, index: usize) -> Result<(), ArrowError> {
Copy link
Contributor Author

@scovich scovich Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was easily the most annoying part of the whole columnar/row-builder transformation -- there's technically no guarantee that subsequent append_row calls use sequentially increasing index values, so we have to handle jumps both forward and backward while hopefully being efficient in the common case of sequential index values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANother reason that a batch API might be better

@scovich
Copy link
Contributor Author

scovich commented Sep 9, 2025

CC @alamb @carpecodeum

Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attn @codephage2020 -- this PR conflicts with

Most of the code that other PR touched has been deleted by this PR, so I did my best to capture the intent in the new framework.

While resolving conflicts, I also had to expand the define_row_builder! macro to handle fallible conversions, which increased its complexity a bit. Hopefully still readable. The merge commit 26ae5c0 actually makes it pretty easy to see what changed.

Comment on lines -49 to -60
/// Options for controlling the behavior of `cast_to_variant_with_options`.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct CastOptions {
/// If true, return error on conversion failure. If false, insert null for failed conversions.
pub strict: bool,
}

impl Default for CastOptions {
fn default() -> Self {
Self { strict: true }
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to type_conversions.rs

}};
/// Options for controlling the behavior of `cast_to_variant_with_options`.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct CastOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from cast_to_variant.rs

}

/// A [`VariantBuilderExt`] that inserts a new field into a variant object.
pub struct ObjectFieldBuilder<'o, 'v, 's> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from from_json.rs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich . I went through this fairly carefully and it is very impressive. I like the API that hides converting an ArrowArray and writes to a VariantBuilderExt 👍

My biggest comment / suggestion is to consider making the API vectorized (convert the entire Arrow Array) but I think we can do that as a follow on PR

cc @liamzwbao @codephage2020 @carpecodeum and @klion26 for your reviewing pleasure.

Given the amount of code in this PR and likelihood of conflicts, I would like to merge it sooner rather than later. Please everyone give a shout out if they want more time to review, or if they are happy to review post merge

Comment on lines +206 to +207
/// Appending NULL to a variant array produces an actual NULL value
fn append_null(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will overwrite any value that is currently set -- so if someone did

builder.append_value(..);
builder.append_null(); // will wipe out the value ??

I am not sure this is a problem or not, I just wanted to confirm my understanding

}

impl<'a> ArrowToVariantRowBuilder<'a> {
pub fn append_row(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason in the future this couldn't be vectorized?

Something like

    pub fn append_all_rows(        &mut self,
        builder: &mut impl VariantBuilderExt,
    
...

as written here it is going to have a per-row type dispatch

I think we could do it as a follow on PR potentially if needed

}

/// Factory function to create the appropriate row builder for a given DataType
pub(crate) fn make_arrow_to_variant_row_builder<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮

}
}

fn append_row(&self, builder: &mut impl VariantBuilderExt, index: usize) -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, it might eventually be a good idea to make a vectorized version of this method (that gets given something that can create VariantBuilderExt) but let's get this working first and then optimize later if/when needed


define_row_builder!(
struct Decimal32ArrowToVariantBuilder<'a> {
scale: i8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite clever

Ok(())
}

fn set_run_for_index(&mut self, index: usize) -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANother reason that a batch API might be better

Comment on lines 901 to 913
let mut variant_builder = array_builder.variant_builder();
row_builder.append_row(&mut variant_builder, 0).unwrap();
variant_builder.finish();

// Test null value
let mut variant_builder = array_builder.variant_builder();
row_builder.append_row(&mut variant_builder, 1).unwrap();
variant_builder.finish();

// Test second value
let mut variant_builder = array_builder.variant_builder();
row_builder.append_row(&mut variant_builder, 2).unwrap();
variant_builder.finish();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code segment is repeated many times. Maybe we could make a function like this

fn copy_all(input_array: &dyn Array) -> VariantArray {
  for i in 0..input_array.len() {
        let mut variant_builder = array_builder.variant_builder();
        row_builder.append_row(&mut variant_builder, 1).unwrap();
        variant_builder.finish();
   }
}

A more ideal version in my mind would convert the entire Array (not just a row at a time). Something like this (untested). Subsets of rows could be handled via slice()ing the Array

// make conevrter
        let mut row_builder =
            make_arrow_to_variant_row_builder(int_array.data_type(), &int_array, &options).unwrap();
// create output variant array
       let mut variant_builder = array_builder.variant_builder();
// convert all rows from the array        
        row_builder.convert(&mut array_builder).unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the boilerplate explosion. It should be fixed in the latest commit (~400 LoC eliminated)

)));
}
// Process each row using the row builder
for i in 0..input.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is certainly nice

@scovich
Copy link
Contributor Author

scovich commented Sep 9, 2025

Re

My biggest comment / suggestion is to consider making the API vectorized (convert the entire Arrow Array) but I think we can do that as a follow on PR

And #8299 (comment) -- that run-end encoding could be handled more easily in a vectorized API.

And #8299 (comment) that suggests an append_all_rows() method.

And #8299 (comment) that also wonders about vectorization.

I'll try to give one response that covers them all:

I think it's reasonable to consider adding a bulk append type API, but we have to be cognizant of the limitations and challenges it will face:

  • We will need a new trait that knows how to create (and finish!) variant builder instances
  • Variant building is inherently row-based, so any builder that ultimately needs to produce a variant array or variant object as its output will have a trivial append_all_rows that just calls append_row in a loop (like today), in order to recursively build up the fields/elements of the variant it creates.
  • The API would be very nice for converting primitive arrays to variant, because they don't need to recurse on anything. Also nice because we could potentially define a specialized impl just for VariantArrayBuilder, so we don't have to deal with that new variant builder create+finish trait.
  • Casting a list of primitive values is an interesting intermediate case, where one should be able to append all the elements of a given list in one shot. But that might require the new create+finish trait? Or maybe it just needs a second specialization for ListBuilder?
  • Maybe instead of a no-arg append_all_rows(), we should consider a ranged append_many_rows(start..end)? One could always pass .. to request encoding of all rows.

@klion26
Copy link
Member

klion26 commented Sep 10, 2025

Given the amount of code in this PR and likelihood of conflicts, I would like to merge it sooner rather than later. Please everyone give a shout out if they want more time to review, or if they are happy to review post merge

I'm happy to review post-merge. I'm following/understanding the code now, and don't have any blocking issue for now. The implementation is cool; I've learned a lot from this PR. I will post my comments if there are any, after the PR has been merged.

Copy link
Contributor

@codephage2020 codephage2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While resolving conflicts, I also had to expand the define_row_builder! macro to handle fallible conversions, which increased its complexity a bit. Hopefully still readable.

Really impressive work—Learned a lot, and no issues found. Everything looks solid to me! Thanks!

@scovich
Copy link
Contributor Author

scovich commented Sep 10, 2025

@alamb -- I have a few cosmetic cleanups to push (in a few minutes) and then this can merge whenever you're ready

@scovich
Copy link
Contributor Author

scovich commented Sep 10, 2025

Once this merges, I did some pathfinding on the bulk append and will push a couple (stacked) PR to start the discussion.

@alamb alamb merged commit dc9c3cd into apache:main Sep 10, 2025
13 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 10, 2025

Awesome -- thanks @scovich and @codephage2020 -- this is really cool

}

#[test]
fn test_primitive_row_builder() {
Copy link
Member

@klion26 klion26 Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These added tests are very nice tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Use row-oriented builders in cast_to_variant

4 participants