You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution (#8006)
# Which issue does this PR close?
- Part of #4886
- Follow up to #7834
# Rationale for this change
Apache Avro’s [single object
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
prefixes every record with the marker `0xC3 0x01` followed by a `Rabin`
[schema fingerprint
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints)
so that readers can identify the correct writer schema without carrying
the full definition in each message.
While the current `arrow‑avro` implementation can read container files,
it cannot ingest these framed messages or handle streams where the
writer schema changes over time.
The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin)
hashed fingerprint of the [parsed canonical form of a
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
to look up the `Schema` from a local schema store or registry.
This PR introduces **`SchemaStore`** and **fingerprinting** to enable:
* **Zero‑copy schema identification** for decoding streaming Avro
messages published in single‑object format (i.e. Kafka, Pulsar, etc)
into Arrow.
* **Dynamic schema evolution** by laying the foundation to resolve
writer reader schema differences on the fly.
**NOTE:** Schema Resolution support in `Codec` and `RecordDecoder`
coming the next PR.
# What changes are included in this PR?
| Area | Highlights |
| ------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **`reader/mod.rs`** | Decoder now detects the `C3 01` prefix, extracts
the fingerprint, looks up the writer schema in a `SchemaStore`, and
switches to an LRU cached `RecordDecoder` without interrupting
streaming; supports `static_store_mode` to skip the 2 byte peek for
high‑throughput fixed‑schema pipelines. |
| **`ReaderBuilder`** | New builder configuration methods:
`.with_writer_schema_store`, `.with_active_fingerprint`,
`.with_static_store_mode`, `.with_reader_schema`,
`.with_max_decoder_cache_size`, with rigorous validation to prevent
misconfiguration. |
| **Unit tests** | New tests covering fingerprint generation, store
registration/lookup, schema switching, unknown‑fingerprint errors, and
interaction with UTF8‑view decoding. |
| **Docs & Examples** | Extensive inline docs with examples on all new
public methods / structs. |
---
# Are these changes tested?
Yes. New tests cover:
1. **Fingerprinting** against the canonical examples from the Avro spec
2. **`SchemaStore` behavior** deduplication, duplicate registration, and
lookup.
3. **Decoder fast‑path** with `static_store_mode=true`, ensuring the
prefix is treated as payload, the 2 byte peek is skipped, and no schema
switch is attempted.
# Are there any user-facing changes?
N/A
# Follow-Up PRs
1. Implement Schema Resolution Functionality in Codec and RecordDecoder
2. Add ID `Fingerprint` variant on `SchemaStore` for Confluent Schema
Registry compatibility
3. Improve arrow-avro errors + add more benchmarks & examples to prepare
for public release
---------
Co-authored-by: Ryan Johnson <[email protected]>
0 commit comments