-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Implement row builders for cast_to_variant #8299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few hopefully useful notes for reviewers
/// Convert the input array to a `VariantArray` row by row, using `method` | ||
/// not requiring a generic type to downcast the generic array to a specific | ||
/// array type and `cast_fn` to transform each element to a type compatible with Variant | ||
#[allow(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These macros are fully unused now that #8280 has merged, but I wasn't sure if they are useful to keep around "just in case" ?
/// Appending NULL to a variant array produces an actual NULL value | ||
fn append_null(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new append_null
capability was critical to making all this work:
- Appending NULL to a variant array builder produces an actual NULL
- Appending NULL to an object field builder is a no-op (field is missing, as per variant spec)
- Otherwise, appending NULL to a variant value or variant list produces
Variant::Null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it will overwrite any value that is currently set -- so if someone did
builder.append_value(..);
builder.append_null(); // will wipe out the value ??
I am not sure this is a problem or not, I just wanted to confirm my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. The variant builder API is pretty broken in general if builder methods are invoked more than once, so IMO this isn't making it any worse.
Ok(()) | ||
} | ||
|
||
struct ObjectFieldBuilder<'o, 'v, 's> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to builders.rs
TBH, I never quite understood why it was defined here in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was mostly gutted (but unit tests left in place), with the real work now being done by row builders defined in arrow_to_variant.rs via make_arrow_to_variant_row_builder
.
// TODO do we need a cast_with_options to allow specifying conversion behavior, | ||
// e.g. how to handle overflows, whether to convert to Variant::Null or return | ||
// an error, etc. ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither changed nor moved... it's just diff weirdness.
// WARNING: normalized_keys panics if values is empty | ||
let normalized_keys = match values.len() { | ||
0 => Vec::new(), | ||
_ => dict_array.normalized_keys(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: I'm pretty sure the original code did not handle this corner case
Ok(()) | ||
} | ||
|
||
fn set_run_for_index(&mut self, index: usize) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was easily the most annoying part of the whole columnar/row-builder transformation -- there's technically no guarantee that subsequent append_row
calls use sequentially increasing index values, so we have to handle jumps both forward and backward while hopefully being efficient in the common case of sequential index values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANother reason that a batch API might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attn @codephage2020 -- this PR conflicts with
Most of the code that other PR touched has been deleted by this PR, so I did my best to capture the intent in the new framework.
While resolving conflicts, I also had to expand the define_row_builder!
macro to handle fallible conversions, which increased its complexity a bit. Hopefully still readable. The merge commit 26ae5c0 actually makes it pretty easy to see what changed.
/// Options for controlling the behavior of `cast_to_variant_with_options`. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct CastOptions { | ||
/// If true, return error on conversion failure. If false, insert null for failed conversions. | ||
pub strict: bool, | ||
} | ||
|
||
impl Default for CastOptions { | ||
fn default() -> Self { | ||
Self { strict: true } | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to type_conversions.rs
}}; | ||
/// Options for controlling the behavior of `cast_to_variant_with_options`. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct CastOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from cast_to_variant.rs
} | ||
|
||
/// A [`VariantBuilderExt`] that inserts a new field into a variant object. | ||
pub struct ObjectFieldBuilder<'o, 'v, 's> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from from_json.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich . I went through this fairly carefully and it is very impressive. I like the API that hides converting an ArrowArray and writes to a VariantBuilderExt
👍
My biggest comment / suggestion is to consider making the API vectorized (convert the entire Arrow Array) but I think we can do that as a follow on PR
cc @liamzwbao @codephage2020 @carpecodeum and @klion26 for your reviewing pleasure.
Given the amount of code in this PR and likelihood of conflicts, I would like to merge it sooner rather than later. Please everyone give a shout out if they want more time to review, or if they are happy to review post merge
/// Appending NULL to a variant array produces an actual NULL value | ||
fn append_null(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it will overwrite any value that is currently set -- so if someone did
builder.append_value(..);
builder.append_null(); // will wipe out the value ??
I am not sure this is a problem or not, I just wanted to confirm my understanding
} | ||
|
||
impl<'a> ArrowToVariantRowBuilder<'a> { | ||
pub fn append_row( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason in the future this couldn't be vectorized?
Something like
pub fn append_all_rows( &mut self,
builder: &mut impl VariantBuilderExt,
...
as written here it is going to have a per-row type dispatch
I think we could do it as a follow on PR potentially if needed
} | ||
|
||
/// Factory function to create the appropriate row builder for a given DataType | ||
pub(crate) fn make_arrow_to_variant_row_builder<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮
} | ||
} | ||
|
||
fn append_row(&self, builder: &mut impl VariantBuilderExt, index: usize) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, it might eventually be a good idea to make a vectorized version of this method (that gets given something that can create VariantBuilderExt) but let's get this working first and then optimize later if/when needed
|
||
define_row_builder!( | ||
struct Decimal32ArrowToVariantBuilder<'a> { | ||
scale: i8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite clever
Ok(()) | ||
} | ||
|
||
fn set_run_for_index(&mut self, index: usize) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANother reason that a batch API might be better
let mut variant_builder = array_builder.variant_builder(); | ||
row_builder.append_row(&mut variant_builder, 0).unwrap(); | ||
variant_builder.finish(); | ||
|
||
// Test null value | ||
let mut variant_builder = array_builder.variant_builder(); | ||
row_builder.append_row(&mut variant_builder, 1).unwrap(); | ||
variant_builder.finish(); | ||
|
||
// Test second value | ||
let mut variant_builder = array_builder.variant_builder(); | ||
row_builder.append_row(&mut variant_builder, 2).unwrap(); | ||
variant_builder.finish(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code segment is repeated many times. Maybe we could make a function like this
fn copy_all(input_array: &dyn Array) -> VariantArray {
for i in 0..input_array.len() {
let mut variant_builder = array_builder.variant_builder();
row_builder.append_row(&mut variant_builder, 1).unwrap();
variant_builder.finish();
}
}
A more ideal version in my mind would convert the entire Array (not just a row at a time). Something like this (untested). Subsets of rows could be handled via slice()
ing the Array
// make conevrter
let mut row_builder =
make_arrow_to_variant_row_builder(int_array.data_type(), &int_array, &options).unwrap();
// create output variant array
let mut variant_builder = array_builder.variant_builder();
// convert all rows from the array
row_builder.convert(&mut array_builder).unwrap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the boilerplate explosion. It should be fixed in the latest commit (~400 LoC eliminated)
))); | ||
} | ||
// Process each row using the row builder | ||
for i in 0..input.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is certainly nice
Re
And #8299 (comment) -- that run-end encoding could be handled more easily in a vectorized API. And #8299 (comment) that suggests an And #8299 (comment) that also wonders about vectorization. I'll try to give one response that covers them all: I think it's reasonable to consider adding a bulk append type API, but we have to be cognizant of the limitations and challenges it will face:
|
I'm happy to review post-merge. I'm following/understanding the code now, and don't have any blocking issue for now. The implementation is cool; I've learned a lot from this PR. I will post my comments if there are any, after the PR has been merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While resolving conflicts, I also had to expand the
define_row_builder!
macro to handle fallible conversions, which increased its complexity a bit. Hopefully still readable.
Really impressive work—Learned a lot, and no issues found. Everything looks solid to me! Thanks!
@alamb -- I have a few cosmetic cleanups to push (in a few minutes) and then this can merge whenever you're ready |
Once this merges, I did some pathfinding on the bulk append and will push a couple (stacked) PR to start the discussion. |
Awesome -- thanks @scovich and @codephage2020 -- this is really cool |
} | ||
|
||
#[test] | ||
fn test_primitive_row_builder() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These added tests are very nice tests!
Note to reviewers: This PR includes 1600+ LoC of new unit tests. The actual changes are half that big.
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
cast_to_variant
#8310Rationale for this change
The original
cast_to_variant
code did columnar conversions of types to variant. For primitive types this worked ok, but for deeply nested types it means repeatedly creating new variants (and variant metadata), only to re-code them by copying the variant values to new arrays (with new metadata and field ids). Very expensive.What changes are included in this PR?
Follow the example of #8280, and introduce a row builder concept that takes individual array values and writes them to an
impl VariantBuilderExt
. Row builders for complex types instantiate the appropriate list or object builder to pass to their children.Are these changes tested?
Existing unit tests continue to pass. Extensive new unit tests added as well.
Are there any user-facing changes?
VariantBuilderExt
has a newappend_null
method.ObjectFieldBuilder
moved tobuilder.rs
and made public