Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Aug 22, 2025

Which issue does this PR close?

Rationale for this change

When manipulating existing variant values (unshredding, removing fields, etc), the metadata column is already defined and already contains all necessary field ids. In fact, defining new/different field ids would require rewriting the bytes of those already-encoded variant values. We need a way to build variant values that rely on an existing metadata dictionary.

What changes are included in this PR?

  • MetadataBuilder is now a trait, and most methods that work with metadata builders now take &mut dyn MetadataBuilder instead of &mut MetadataBuilder.
  • The old MetadataBuilder struct is now BasicMetadataBuilder that implements MetadataBuilder
  • Define a ReadOnlyMetadataBuilder that wraps a VariantMetadata and which also implements MetadataBuilder
  • Update the try_binary_search_range_by helper method to be more general, so we can define an efficient VariantMetadata::get_entry that returns the field id for a given field name.

Are these changes tested?

Existing tests cover the basic metadata builder. New tests added to cover the read-only metadata builder.

Are there any user-facing changes?

The renamed BasicMetadataBuilder (breaking), the new MetadataBuilder trait (breaking), and the new ReadOnlyMetadataBuilder.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Aug 22, 2025
@scovich
Copy link
Contributor Author

scovich commented Aug 22, 2025

The docs error will self-resolve once #8206 merges:

error: public documentation for `MetadataBuilder` links to private item `ValueBuilder`
   --> parquet-variant/src/builder.rs:437:7
    |
437 | /// [`ValueBuilder`]. The trait provides methods for managing field names and their IDs, as well as
    |       ^^^^^^^^^^^^ this item is private
    |

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing metadata-builder-trait (1a4493b) to cec24a0 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=metadata-builder-trait
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                                                                main                                   metadata-builder-trait
-----                                                                ----                                   ----------------------
batch_json_string_to_variant json_list 8k string                     1.00     25.9±0.15ms        ? ?/sec    1.02     26.4±0.09ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    298.8±6.38ms        ? ?/sec    1.02    306.3±6.90ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00      7.7±0.04ms        ? ?/sec    1.02      7.9±0.16ms        ? ?/sec
variant_get_primitive                                                1.00   1077.9±3.16µs        ? ?/sec    1.02   1102.7±4.00µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing metadata-builder-trait (1a4493b) to cec24a0 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=metadata-builder-trait
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                                       main                                   metadata-builder-trait
-----                                       ----                                   ----------------------
bench_extend_metadata_builder               1.00     53.4±2.28ms        ? ?/sec    1.03     55.1±3.44ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     19.6±0.46ms        ? ?/sec    1.04     20.3±0.48ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1218.7±15.41µs        ? ?/sec    1.04  1266.7±14.56µs        ? ?/sec
bench_object_list_same_schema               1.00     24.4±0.22ms        ? ?/sec    1.04     25.3±0.19ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.1±0.16ms        ? ?/sec    1.03     13.6±0.16ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.2±0.01ms        ? ?/sec    1.02      3.3±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     37.0±0.10ms        ? ?/sec    1.02     37.7±0.12ms        ? ?/sec
bench_object_unknown_schema                 1.00     15.9±0.02ms        ? ?/sec    1.01     16.1±0.05ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.6±0.01ms        ? ?/sec    1.01      2.7±0.01ms        ? ?/sec
iteration/validated_iteration               1.00     49.5±0.28µs        ? ?/sec    1.01     49.8±0.11µs        ? ?/sec
validation/unvalidated_construction         1.00      6.7±0.06µs        ? ?/sec    1.00      6.7±0.02µs        ? ?/sec
validation/validated_construction           1.00     60.6±0.14µs        ? ?/sec    1.02     61.5±0.21µs        ? ?/sec
validation/validation_cost                  1.00     53.8±0.11µs        ? ?/sec    1.05     56.5±0.12µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing metadata-builder-trait (1a4493b) to cec24a0 diff
BENCH_NAME=variant_validation
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_validation
BENCH_FILTER=
BENCH_BRANCH_NAME=metadata-builder-trait
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                               main                                   metadata-builder-trait
-----                               ----                                   ----------------------
bench_validate_complex_object       1.00    228.3±1.68µs        ? ?/sec    1.00    229.3±0.69µs        ? ?/sec
bench_validate_large_nested_list    1.00     19.4±0.03ms        ? ?/sec    1.00     19.5±0.05ms        ? ?/sec
bench_validate_large_object         1.00     54.9±0.09ms        ? ?/sec    1.00     54.9±0.09ms        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- this makes sense to me. I had a suggestion regarding naming, but overall it looks really nice 👌

cc @codephage2020 and @klion26, perhaps you would like to review this too

/// Builder for constructing metadata for [`Variant`] values.
///
/// This is used internally by the [`VariantBuilder`] to construct the metadata
///
/// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl.
#[derive(Default, Debug)]
struct MetadataBuilder {
struct BasicMetadataBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that Basic is a bit generic and it is somewhat unclear what the difference between a "ReadOnly" builder and a "Basic" builder are.

Here are some other ideas for names:

  • WriteableMetadataBuilder -- can update the metadata (this is my preference)
  • OwnedMetadataBuilder --- the builder owns the underlying structures (can thus can make changes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with WritableMetadataBuilder.

@@ -589,14 +675,14 @@ enum ParentState<'a> {
Variant {
value_builder: &'a mut ValueBuilder,
saved_value_builder_offset: usize,
metadata_builder: &'a mut MetadataBuilder,
metadata_builder: &'a mut dyn MetadataBuilder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using &dyn here I think means all accesses now need a level of indirection (and I think we can see this slowdown of a few percent in the benchmarks)

I wonder if (maybe as a follow on PR) we can consider making parent state generic and squeezing that performance back. It may also be premature optimization for now

Copy link
Contributor Author

@scovich scovich Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was trying to avoid the "pollution" that generics would cause, which is clearly visible in my first attempt:

It's not just ParentState, but also ListBuilder, ObjectBuilder and even the VariantBuilderExt trait. Plus it requires a bunch of methods that are only defined for specific combinations of types.

But we may need to go generic now, because VariantArrayVariantBuilder::finish needs access to BasicMetadataBuilder::finish in order to work correctly, and that's unavailable because the parent state has upcast it to &mut dyn MetadataBuilder.

I initially reached for std::any::AsAny, but it turns out that types with non-static lifetimes cannot implement std::any::Any because the lifetime information would be lost.

In order to get the latest merge with upstream to compile, I had to define a new MetadataBuilder::finish trait method to expose the underlying BasicMetadataBuilder::finish, but that also forced a no-op ReadOnlyMetadataBuilder::finish.

Honestly, the generic mess is messy enough that I lean strongly toward keeping the dyn indirection unless we're really certain we need generics for performance or functionality reasons.

Copy link
Contributor Author

@scovich scovich Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related issue we'll hit shortly: VariantArrayBuilder is currently hard-wired to use a BasicMetadataBuilder that produces a new metadata column.

Once we need a variant array builder that reuses an existing metadata column (e.g. for unshredding), we'll be forced to decide whether we want to make VariantArrayBuilder generic over M: MetadataBuilder or just define a second builder class.

I suspect that choice will ultimately decide the generic-vs-not question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could also avoid generics altogether by defining our own versions of Any and AsAny that preserve lifetime info... but they would become part of the public arrow-rs API which seems awkward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, the generic mess is messy enough that I lean strongly toward keeping the dyn indirection unless we're really certain we need generics for performance or functionality reasons.

I think this is a wise strategy. Let's wait for some more "end to end" type benchmarks (like reading/writing JSON to arrays, or shredding variants) and see if we need to try and squeeze a few more percent out

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

(this PR also needs an update to resolve some merge conflcits)

@alamb
Copy link
Contributor

alamb commented Aug 25, 2025

Thanks again @scovich

@alamb alamb merged commit a620957 into apache:main Aug 25, 2025
12 checks passed
@klion26
Copy link
Member

klion26 commented Aug 26, 2025

Sorry for the late reply, I just came back from out yesterday and just saw the notifications. There is no more comment from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support creating Variants with pre-existing Metadata
3 participants