Add arrow-avro Reader support for Dense Union and Union resolution (Part 1) #8348

jecsand838 · 2025-09-15T06:15:30Z

Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec.

Related to: Add Avro Support #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion.

Rationale for this change

arrow-avro lacked end‑to‑end support for Avro unions and Arrow Union schemas. Many Avro datasets rely on unions (i.e., ["null","string"], tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR brings union schema resolution to parity with the Avro spec (duplicate-branch and nested‑union checks), adds Arrow to Avro union schema conversion (with mode/type‑id metadata), and lays groundwork for data decoding in a follow‑up.

What changes are included in this PR?

Schema resolution & codecs

Add Codec::Union(Arc<[AvroDataType]>, UnionFields, UnionMode) and map it to Arrow DataType::Union.
Introduce ResolvedUnion and extend ResolutionInfo with a Union(...) variant to capture writer to reader branch mapping (prefers direct matches over promotions).
Support union defaults: permit null defaults for unions whose first branch is null; reject empty unions for defaults.
Enforce Avro spec constraints during parsing/resolution:
- Disallow nested unions.
- Disallow duplicate branch kinds (except distinct named record/enum/fixed).
Keep writer null ordering when resolving nullable 2‑branch unions (i.e., ["null", "int"] vs ["int", "null"]).
Provide stable union field names derived from branch kind (i.e., int, string, map, ...) and construct dense UnionFields consistently.

Arrow and Avro schema conversion

Implement Arrow DataType::Union to Avro union JSON:
- Persist Arrow union layout via metadata keys:
  - "arrowUnionMode": "dense" or "sparse".
  - "arrowUnionTypeIds": ordered list of Arrow type IDs.
- Attach union‑level metadata to the first non‑null branch object (Avro JSON can’t carry attributes on the union array).
Persist additional Arrow metadata in Avro JSON:
- "arrowBinaryView" for BinaryView.
- "arrowListView" / "arrowLargeList" for list view types.
Reject invalid output shapes (i.e., a union branch that is itself an Avro union).

Reader/decoder stub

Return a clear error for union value decoding in RecordDecoder (schema support first; decoding to follow).

Refactors & utilities

Expose make_full_name within the crate for union branch keying; derive Hash for PrimitiveType; add helpers for branch de‑duplication.

Are these changes tested?

Yes. New unit tests cover:

Resolution across writer/reader unions and non‑unions (direct vs promoted matches, partial coverage).
Nullable‑union semantics (writer null ordering preserved).
Arrow Union to Avro union JSON including mode/type‑id metadata and branch shapes.
Validation errors for duplicates and nested unions.

Are there any user-facing changes?

N/A

jecsand838 · 2025-09-15T06:43:13Z

arrow-avro/src/codec.rs

            nullable_union_variants(reader_variants),
        ) {
-            (Some((_, write_nonnull)), Some((read_nb, read_nonnull))) => {
+            (Some((write_nb, write_nonnull)), Some((_read_nb, read_nonnull))) => {


I realized while regression testing Unions that it made more sense and resulted in less complexity in Union decoding to use the writer's null ordering. This in itself also does not deviate from the Avro specification.

jecsand838 · 2025-09-15T06:47:39Z

arrow-avro/src/codec.rs

-impl Codec {
-    /// Converts a string codec to use Utf8View if requested
-    ///
-    /// The conversion only happens if both:
-    /// 1. `use_utf8view` is true
-    /// 2. The codec is currently `Utf8`
-    ///
-    /// # Example
-    /// ```
-    /// # use arrow_avro::codec::Codec;
-    /// let utf8_codec1 = Codec::Utf8;
-    /// let utf8_codec2 = Codec::Utf8;
-    ///
-    /// // Convert to Utf8View
-    /// let view_codec = utf8_codec1.with_utf8view(true);
-    /// assert!(matches!(view_codec, Codec::Utf8View));
-    ///
-    /// // Don't convert if use_utf8view is false
-    /// let unchanged_codec = utf8_codec2.with_utf8view(false);
-    /// assert!(matches!(unchanged_codec, Codec::Utf8));
-    /// ```
-    pub fn with_utf8view(self, use_utf8view: bool) -> Self {
-        if use_utf8view && matches!(self, Self::Utf8) {
-            Self::Utf8View
-        } else {
-            self


I just moved this code to the other Codec implementation block to consolidate.

…s in Avro union validation - Expanded `arrow-avro` to support resolving Arrow unions to Avro unions including validation against duplicate branch types, nullability handling, and prohibition of nested unions. - Refactored `Codec`, schema resolution, and added new utility functions for union branch management. - Introduced dense union field mapping, Arrow metadata persistence, and Avro union branch validation. - Extended tests to cover union scenarios including dense mode, metadata persistence, and type promotion.

nathaniel-d-ef

This looks great to me, just one question on the inlining from previous work.

nathaniel-d-ef · 2025-09-16T13:23:18Z

arrow-avro/src/codec.rs

    }
 }

+#[inline]


Are we omitting these optimizations until we can prove them out?

@nathaniel-d-ef Thats fair enough. I took out the #[inline].

jecsand838 · 2025-09-17T15:49:49Z

@alamb Let me know if you have time to get to this one. After this there's just one more to go. 😃

alamb

Thanks @jecsand838 -- as always I found the PR easy to read and understand, though given my limited avro knowledge I will not be able to pick up subtle avro issues.

The only thing I didn't see in this PR was "end to end" tests -- namely reading union data into arrow UnionArrays

Ideally we would also have round trip tests where we wrote a UnionArray to avro and the read it back again ensuring the resulting array was the same

jecsand838 · 2025-09-18T01:05:59Z

Thanks @jecsand838 -- as always I found the PR easy to read and understand, though given my limited avro knowledge I will not be able to pick up subtle avro issues.

The only thing I didn't see in this PR was "end to end" tests -- namely reading union data into arrow UnionArrays

Ideally we would also have round trip tests where we wrote a UnionArray to avro and the read it back again ensuring the resulting array was the same

@alamb Absolutely!

So I have the end to end tests in the part 2 PR #8349 . This PR only covers the codec and schema changes. The part 2 PR has the decoder updates and end to end tests along with a new test file containing all possible Union type scenarios. I just couldn't find a cleaner way to break the work up.

Also the roundtrip tests will be in @nathaniel-d-ef 's upcoming PR for Union type support in the Writer.

Let me know if you're okay with this breakdown.

alamb · 2025-09-18T10:20:15Z

Let me know if you're okay with this breakdown.

Sounds good to me 🚀

scovich

Oops... my review took too long...

scovich · 2025-09-18T07:40:58Z

arrow-avro/src/codec.rs

+                Codec::Union(encodings, _, _) if !encodings.is_empty()
+                    && matches!(encodings[0].codec(), Codec::Null) =>
+                    {
+                        Ok(AvroLiteral::Null)
+                    }


aside: that is some funky formatting, but I guess it's what fmt produced?

Agreed, but that's what came out of fmt.

scovich · 2025-09-18T07:42:24Z

arrow-avro/src/codec.rs

+                if encodings.is_empty() {
+                    return Err(ArrowError::SchemaError(
+                        "Union with no branches cannot have a default".to_string(),
+                    ));
+                }
+                encodings[0].parse_default_literal(default_json)?
+            }


nit

Suggested change

if encodings.is_empty() {

return Err(ArrowError::SchemaError(

"Union with no branches cannot have a default".to_string(),

));

}

encodings[0].parse_default_literal(default_json)?

}

let Some(default_encoding) = encodings.first() else {

return Err(ArrowError::SchemaError(

"Union with no branches cannot have a default".to_string(),

));

};

default_encoding.parse_default_literal(default_json)?

}

scovich · 2025-09-18T07:51:53Z

arrow-avro/src/codec.rs

+        Schema::Complex(ComplexType::Record(r)) => {
+            let (full, _) = make_full_name(r.name, r.namespace, enclosing_ns);
+            Some(UnionBranchKey::Named(full))
+        }
+        Schema::Complex(ComplexType::Enum(e)) => {
+            let (full, _) = make_full_name(e.name, e.namespace, enclosing_ns);
+            Some(UnionBranchKey::Named(full))
+        }
+        Schema::Complex(ComplexType::Fixed(f)) => {
+            let (full, _) = make_full_name(f.name, f.namespace, enclosing_ns);
+            Some(UnionBranchKey::Named(full))
+        }


qq: Would it be cleaner -- or not -- to rearrange this match as follows?

let (name, namespace) = match s { Schema::TypeName(TypeName::Primitive(p)) | Schema::Type(Type { r#type: TypeName::Primitive(p), .. }) => return Some(UnionBranchKey::Primitive(*p)), Schema::TypeName(TypeName::Ref(name)) | Schema::Type(Type { r#type: TypeName::Ref(name), .. }) => (name, None), Schema::Complex(ComplexType::Array(_)) => return Some(UnionBranchKey::Array), Schema::Complex(ComplexType::Map(_)) => return Some(UnionBranchKey::Map), Schema::Complex(ComplexType::Record(r)) => (r.name, r.namespace), Schema::Complex(ComplexType::Enum(e)) => (e.name, e.namespace), Schema::Complex(ComplexType::Fixed(f)) => (f.name, f.namespace), Schema::Union(_) => return None, }; let (full, _) = make_full_name(name, namespace, enclosing_ns); Some(UnionBranchKey::Named(full))

That's much cleaner. Ty for that suggestion.

scovich · 2025-09-18T07:54:18Z

arrow-avro/src/codec.rs

+    branches: &'a [Schema<'a>],
+    enclosing_ns: Option<&'a str>,
+) -> Option<String> {
+    let mut seen: HashSet<UnionBranchKey> = HashSet::with_capacity(branches.len());


nit: type annotation shouldn't be necessary?

100%, I'm removing that.

scovich · 2025-09-18T08:02:58Z

arrow-avro/src/codec.rs

+                match (
+                    nullable_union_variants(writer_variants.as_slice()),
+                    nullable_union_variants(reader_variants.as_slice()),
+                ) {
+                    (Some((w_nb, w_nonnull)), Some((_r_nb, r_nonnull))) => {
+                        let mut dt = self.make_data_type(w_nonnull, Some(r_nonnull), namespace)?;
+                        dt.nullability = Some(w_nb);
+                        Ok(dt)
+                    }
+                    _ => self.resolve_unions(
+                        writer_variants.as_slice(),
+                        reader_variants.as_slice(),
+                        namespace,
+                    ),


Suggested change

match (

nullable_union_variants(writer_variants.as_slice()),

nullable_union_variants(reader_variants.as_slice()),

) {

(Some((w_nb, w_nonnull)), Some((_r_nb, r_nonnull))) => {

let mut dt = self.make_data_type(w_nonnull, Some(r_nonnull), namespace)?;

dt.nullability = Some(w_nb);

Ok(dt)

}

_ => self.resolve_unions(

writer_variants.as_slice(),

reader_variants.as_slice(),

namespace,

),

let writer_variants = writer_variants.as_slice();

let reader_variants = reader_variants.as_slice();

match (

nullable_union_variants(writer_variants),

nullable_union_variants(reader_variants),

) {

(Some((w_nb, w_nonnull)), Some((_r_nb, r_nonnull))) => {

let mut dt = self.make_data_type(w_nonnull, Some(r_nonnull), namespace)?;

dt.nullability = Some(w_nb);

Ok(dt)

}

_ => self.resolve_unions(writer_variants, reader_variants, namespace),

scovich · 2025-09-18T08:10:39Z

arrow-avro/src/codec.rs

+                let mut writer_to_reader: Vec<Option<(usize, Promotion)>> =
+                    Vec::with_capacity(writer_variants.len());
+                for writer in writer_variants {
+                    match self.resolve_type(writer, reader_non_union, namespace) {
+                        Ok(tmp) => writer_to_reader.push(Some((0usize, Self::coercion_from(&tmp)))),
+                        Err(_) => writer_to_reader.push(None),
+                    }
+                }
+                let mut dt = self.parse_type(reader_non_union, namespace)?;
+                dt.resolution = Some(ResolutionInfo::Union(ResolvedUnion {
+                    writer_to_reader: Arc::from(writer_to_reader),


nit

Suggested change

let mut writer_to_reader: Vec<Option<(usize, Promotion)>> =

Vec::with_capacity(writer_variants.len());

for writer in writer_variants {

match self.resolve_type(writer, reader_non_union, namespace) {

Ok(tmp) => writer_to_reader.push(Some((0usize, Self::coercion_from(&tmp)))),

Err(_) => writer_to_reader.push(None),

}

}

let mut dt = self.parse_type(reader_non_union, namespace)?;

dt.resolution = Some(ResolutionInfo::Union(ResolvedUnion {

writer_to_reader: Arc::from(writer_to_reader),

let writer_to_reader = writer_variants.iter().filter_map(|writer| {

let tmp = self.resolve_type(writer, reader_non_union, namespace).ok()?;

Some((0usize, Self::coercion_from(&tmp))))

}));

let mut dt = self.parse_type(reader_non_union, namespace)?;

dt.resolution = Some(ResolutionInfo::Union(ResolvedUnion {

writer_to_reader: Arc::from(writer_to_reader.collect()),

I'm getting type annotation errors from this:

type annotations needed [E0283] cannot infer type of the type parameter `B` declared on the method `collect`

Have a slight variation that I'll push up in the PR. This is definitely cleaner though.

scovich · 2025-09-18T08:15:13Z

arrow-avro/src/codec.rs

+                        if how == Promotion::Direct {
+                            direct = Some((reader_index, how));
+                            break; // first exact match wins
+                        } else if promo.is_none() {
+                            promo = Some((reader_index, how));
+                        }


Double checking intent -- Use the first-found promo, unless a direct match is found?

If so, I think we can use just the promo option for both:

if how == Promotion::Direct { promo = Some((reader_index, how)); break; // first exact match wins } if promo.is_none() { // first promo wins, unless an exact match is found later promo = Some((reader_index, how)); }

and then

let Some((reader_index, promotion) = promo else { return ArrowError::SchemaError(...); };

(again below)

scovich · 2025-09-18T13:12:11Z

arrow-avro/src/codec.rs

+        Schema::Union(branches)
+    }
+
+    fn mk_record_named(name: &'static str) -> Schema<'static> {


static lifetimes will pretty strongly constrain real-world usage... is there a reason it needs to be fixed? Why not just

fn mk_record_name<'a>(name: &'a str) -> Schema<'a>

scovich · 2025-09-18T13:15:27Z

arrow-avro/src/schema.rs

+            match null_order {
+                Nullability::NullFirst => {
+                    let mut out = Vec::with_capacity(union.len() + 1);
+                    out.push(null);
+                    out.extend(union);
+                    Value::Array(out)
+                }
+                Nullability::NullSecond => {
+                    union.push(null);
+                    Value::Array(union)
+                }
+            }


Suggested change

match null_order {

Nullability::NullFirst => {

let mut out = Vec::with_capacity(union.len() + 1);

out.push(null);

out.extend(union);

Value::Array(out)

}

Nullability::NullSecond => {

union.push(null);

Value::Array(union)

}

}

match null_order {

Nullability::NullFirst => union.insert(0, null),

Nullability::NullSecond => union.push(null),

}

Value::Array(union)

(I guess it should really be called Nullability::NullLast?)

This is much cleaner.

(I guess it should really be called Nullability::NullLast?)

100% That change is coming. Just wanted to get a dedicated follow-up PR for it.

scovich · 2025-09-18T13:17:06Z

arrow-avro/src/schema.rs

+            })?;
+            match t {
+                "record" | "enum" | "fixed" => {
+                    let name = map.get("name").and_then(|v| v.as_str()).unwrap_or_default();


What is the default &str, out of curiosity?

You know what, I should be throwing an error here. This is out of spec, since "record" | "enum" | "fixed" are named types. Ty for pointing this out. I'll include it in the follow-up.

alamb · 2025-09-18T13:30:26Z

Oops... my review took too long...

Sorry -- hopefully @jecsand838 can address any needed comments as a follow on PR

jecsand838 · 2025-09-18T14:50:39Z

@scovich

Oops... my review took too long...

I'll make another PR with these changes.

@scovich

# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense Union and Union resolution (Part 1)) # Rationale for this change @scovich left a really solid [review](#8348 (review)) on #8348 that wasn't completed until after the PR was merged in. This PR is for addressing the suggestions and improving the code. # What changes are included in this PR? * Code quality improvements to `codec.rs` * Improvements to `schema.rs` including spec compliant named type errors. # Are these changes tested? 1. No functionality was added / modified in `codec.rs` and all existing tests are passing without changes. 2. Two new unit tests were added to `schema.rs` to cover the spec compliant named type changes. # Are there any user-facing changes? N/A --------- Co-authored-by: Ryan Johnson <[email protected]>

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 15, 2025

jecsand838 mentioned this pull request Sep 15, 2025

Add arrow-avro Reader support for Dense Union and Union resolution (Part 2) #8349

Open

jecsand838 force-pushed the avro-reader-union-support-codec branch from 1331619 to 4c1cbcd Compare September 15, 2025 06:21

jecsand838 commented Sep 15, 2025

View reviewed changes

jecsand838 force-pushed the avro-reader-union-support-codec branch from 4c1cbcd to db0adee Compare September 15, 2025 06:44

jecsand838 commented Sep 15, 2025

View reviewed changes

jecsand838 force-pushed the avro-reader-union-support-codec branch from db0adee to 2d3d4e2 Compare September 15, 2025 06:49

jecsand838 mentioned this pull request Sep 15, 2025

Consider using upstream arrow-avro reader apache/datafusion#14097

Open

Merge branch 'main' into avro-reader-union-support-codec

93b3408

nathaniel-d-ef approved these changes Sep 16, 2025

View reviewed changes

Merge branch 'main' into avro-reader-union-support-codec

cb1c7e0

jecsand838 force-pushed the avro-reader-union-support-codec branch from 894e497 to cb1c7e0 Compare September 17, 2025 15:37

alamb reviewed Sep 17, 2025

View reviewed changes

Merge branch 'main' into avro-reader-union-support-codec

c1f185c

alamb approved these changes Sep 18, 2025

View reviewed changes

alamb merged commit aed2f3b into apache:main Sep 18, 2025
23 checks passed

scovich reviewed Sep 18, 2025

View reviewed changes

jecsand838 deleted the avro-reader-union-support-codec branch September 18, 2025 14:50

jecsand838 mentioned this pull request Sep 18, 2025

Follow-up Improvements to Avro union handling #8385

Merged

Add arrow-avro Reader support for Dense Union and Union resolution (Part 1) #8348

Add arrow-avro Reader support for Dense Union and Union resolution (Part 1) #8348

Uh oh!

Conversation

jecsand838 commented Sep 15, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathaniel-d-ef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Sep 17, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Sep 18, 2025

Uh oh!

alamb commented Sep 18, 2025

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 18, 2025

Uh oh!

jecsand838 commented Sep 18, 2025

Uh oh!

Uh oh!

jecsand838 Sep 18, 2025 •

edited

Loading