-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit aed2f3b
authored
Add arrow-avro Reader support for Dense Union and Union resolution (Part 1) (#8348)
# Which issue does this PR close?
This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.
- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
# Rationale for this change
`arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union`
schemas. Many Avro datasets rely on unions (i.e., `["null","string"]`,
tagged unions of different records), and without schema‐level resolution
and JSON encoding the crate could not interoperate cleanly. This PR
brings union schema resolution to parity with the Avro spec
(duplicate-branch and nested‑union checks), adds Arrow to Avro union
schema conversion (with mode/type‑id metadata), and lays groundwork for
data decoding in a follow‑up.
# What changes are included in this PR?
**Schema resolution & codecs**
- Add `Codec::Union(Arc<[AvroDataType]>, UnionFields, UnionMode)` and
map it to Arrow `DataType::Union`.
- Introduce `ResolvedUnion` and extend `ResolutionInfo` with a
`Union(...)` variant to capture writer to reader branch mapping (prefers
direct matches over promotions).
- Support union defaults: permit `null` defaults for unions whose
**first** branch is `null`; reject empty unions for defaults.
- Enforce Avro spec constraints during parsing/resolution:
- Disallow nested unions.
- Disallow duplicate branch *kinds* (except distinct named
`record`/`enum`/`fixed`).
- Keep **writer** null ordering when resolving nullable 2‑branch unions
(i.e., `["null", "int"]` vs `["int", "null"]`).
- Provide stable union field names derived from branch kind (i.e.,
`int`, `string`, `map`, ...) and construct dense `UnionFields`
consistently.
**Arrow and Avro schema conversion**
- Implement Arrow `DataType::Union` to Avro union JSON:
- Persist Arrow union layout via metadata keys:
- `"arrowUnionMode"`: `"dense"` or `"sparse"`.
- `"arrowUnionTypeIds"`: ordered list of Arrow type IDs.
- Attach union‑level metadata to the **first non‑null** branch object
(Avro JSON can’t carry attributes on the union array).
- Persist additional Arrow metadata in Avro JSON:
- `"arrowBinaryView"` for `BinaryView`.
- `"arrowListView"` / `"arrowLargeList"` for list view types.
- Reject invalid output shapes (i.e., a union branch that is itself an
Avro union).
**Reader/decoder stub**
- Return a clear error for union **value** decoding in `RecordDecoder`
(schema support first; decoding to follow).
**Refactors & utilities**
- Expose `make_full_name` within the crate for union branch keying;
derive `Hash` for `PrimitiveType`; add helpers for branch
de‑duplication.
# Are these changes tested?
Yes. New unit tests cover:
- Resolution across writer/reader unions and non‑unions (direct vs
promoted matches, partial coverage).
- Nullable‑union semantics (writer null ordering preserved).
- Arrow `Union` to Avro union JSON including mode/type‑id metadata and
branch shapes.
- Validation errors for duplicates and nested unions.
# Are there any user-facing changes?
N/A1 parent 1f77ac5 commit aed2f3bCopy full SHA for aed2f3b
0 commit comments