Skip to content

Commit aed2f3b

Browse files
authored
Add arrow-avro Reader support for Dense Union and Union resolution (Part 1) (#8348)
# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. # Rationale for this change `arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union` schemas. Many Avro datasets rely on unions (i.e., `["null","string"]`, tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR brings union schema resolution to parity with the Avro spec (duplicate-branch and nested‑union checks), adds Arrow to Avro union schema conversion (with mode/type‑id metadata), and lays groundwork for data decoding in a follow‑up. # What changes are included in this PR? **Schema resolution & codecs** - Add `Codec::Union(Arc<[AvroDataType]>, UnionFields, UnionMode)` and map it to Arrow `DataType::Union`. - Introduce `ResolvedUnion` and extend `ResolutionInfo` with a `Union(...)` variant to capture writer to reader branch mapping (prefers direct matches over promotions). - Support union defaults: permit `null` defaults for unions whose **first** branch is `null`; reject empty unions for defaults. - Enforce Avro spec constraints during parsing/resolution: - Disallow nested unions. - Disallow duplicate branch *kinds* (except distinct named `record`/`enum`/`fixed`). - Keep **writer** null ordering when resolving nullable 2‑branch unions (i.e., `["null", "int"]` vs `["int", "null"]`). - Provide stable union field names derived from branch kind (i.e., `int`, `string`, `map`, ...) and construct dense `UnionFields` consistently. **Arrow and Avro schema conversion** - Implement Arrow `DataType::Union` to Avro union JSON: - Persist Arrow union layout via metadata keys: - `"arrowUnionMode"`: `"dense"` or `"sparse"`. - `"arrowUnionTypeIds"`: ordered list of Arrow type IDs. - Attach union‑level metadata to the **first non‑null** branch object (Avro JSON can’t carry attributes on the union array). - Persist additional Arrow metadata in Avro JSON: - `"arrowBinaryView"` for `BinaryView`. - `"arrowListView"` / `"arrowLargeList"` for list view types. - Reject invalid output shapes (i.e., a union branch that is itself an Avro union). **Reader/decoder stub** - Return a clear error for union **value** decoding in `RecordDecoder` (schema support first; decoding to follow). **Refactors & utilities** - Expose `make_full_name` within the crate for union branch keying; derive `Hash` for `PrimitiveType`; add helpers for branch de‑duplication. # Are these changes tested? Yes. New unit tests cover: - Resolution across writer/reader unions and non‑unions (direct vs promoted matches, partial coverage). - Nullable‑union semantics (writer null ordering preserved). - Arrow `Union` to Avro union JSON including mode/type‑id metadata and branch shapes. - Validation errors for duplicates and nested unions. # Are there any user-facing changes? N/A
1 parent 1f77ac5 commit aed2f3b

File tree

3 files changed

+635
-71
lines changed

3 files changed

+635
-71
lines changed

0 commit comments

Comments
 (0)