Skip to content

Conversation

klion26
Copy link
Member

@klion26 klion26 commented Sep 30, 2025

Which issue does this PR close?

What changes are included in this PR?

  • Add a macro_rule define_variant_to_primitive_builder used to construct the variant to x arrow row builder
  • implement VariantToBooleanArrowRowBuilder/VariantToPrimitiveArrowRowBuilder/VariantToTimestampArrowRowBuilder using the macro define_variant_to_primitive_builder
  • Add type access for Variant::Timestamp/Time(timestamp will automatic widen micros to nanos)
  • Add tests to cover Variant::{int8/float32/float64} for existing code

Are these changes tested?

Added tests

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 30, 2025
@klion26
Copy link
Member Author

klion26 commented Sep 30, 2025

@alamb @scovich Please help review this when you're free, thanks.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this issue -- it's turning out to be more involved than I would have guessed. See in particular my comments about using a macro to define these builders, which should bring flexibility without boilerplate.

Note: I'm happy to help with macro-related things, as guided by your comfort level with macro magic. Just let me know how you'd prefer to approach this.

Comment on lines 115 to 117
Variant::TimestampNanos(dt) => dt.timestamp_nanos_opt(),
Variant::TimestampNtzNanos(ndt) => ndt.and_utc().timestamp_nanos_opt(),
_ => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also widen micros to nanos here, if we wanted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can widen micros to nanos if wanted. will add this.

Comment on lines 124 to 127
// The number of days from 0001-01-01 to 1970-01-01.
const DAYS_FROM_CE_TO_UNIX_EPOCH: i32 = 719163;
self.as_naive_date()
.map(|d| d.num_days_from_ce() - DAYS_FROM_CE_TO_UNIX_EPOCH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use Date32Type::to_naive_date?

It has unwrap calls internally, but I'm fairly certain it cannot actually panic because i32 range isn't big enough to overflow anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to use it.

Did not notice this function. Choosing the current solution seems to have a better performance(vs a handle write logic like Date32Type::from_naive_date). Using the existing function is better for maintenance, and we can improve the function if needed.

Float16(b) => b.append_value(value),
Float32(b) => b.append_value(value),
Float64(b) => b.append_value(value),
Boolean(b) => b.append_value(value),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not ordered like the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, will change it

Comment on lines 215 to 221
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone());

TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(target_type),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's actually smaller code (fewer lines and shorter lines) to not factor out target_type?

Suggested change
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone());
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(target_type),
))
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(DataType::Timestamp(TimeUnit::Microsecond, tz.clone())),
))

(again below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but actually, isn't this just:

Suggested change
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone());
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(target_type),
))
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(data_type.clone())
))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

Comment on lines 447 to 452
if let Some(target_type) = self.target_data_type {
let data = array.into_data();
let new_data = data.into_builder().data_type(target_type).build()?;
let array_with_new_type = PrimitiveArray::<T>::from(new_data);
return Ok(Arc::new(array_with_new_type));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand why we need this juggling of data types, when all the temporal types impl ArrowPrimitiveType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! It's reinstating timezone info that would otherwise be lost

Comment on lines 445 to 452
let array: PrimitiveArray<T> = self.builder.finish();

if let Some(target_type) = self.target_data_type {
let data = array.into_data();
let new_data = data.into_builder().data_type(target_type).build()?;
let array_with_new_type = PrimitiveArray::<T>::from(new_data);
return Ok(Arc::new(array_with_new_type));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let array: PrimitiveArray<T> = self.builder.finish();
if let Some(target_type) = self.target_data_type {
let data = array.into_data();
let new_data = data.into_builder().data_type(target_type).build()?;
let array_with_new_type = PrimitiveArray::<T>::from(new_data);
return Ok(Arc::new(array_with_new_type));
}
let mut array = self.builder.finish();
if let Some(target_type) = self.target_data_type {
let data = array.into_data();
let new_data = data.into_builder().data_type(target_type).build()?;
array = new_data.into();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it

Comment on lines 402 to 406
fn new_with_target_type(
cast_options: &'a CastOptions<'a>,
capacity: usize,
target_data_type: Option<DataType>,
) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer/cleaner to limit this to timestamp types only?

impl<'a, T: ArrowTimestampType> VariantToPrimitiveArrowRowBuilder<'a, T> {
    fn with_target_type(mut self, data_type: DataType) -> Self {
        self.target_data_type = Some(data_type)
        self
    }
}

and then e.g.

        DataType::Timestamp(TimeUnit::Microsecond, None) => TimestampMicro(
            VariantToPrimitiveArrowRowBuilder::new(cast_options, capacity)
        ),
        DataType::Timestamp(TimeUnit::Microsecond, _) => TimestampMicro(
            VariantToPrimitiveArrowRowBuilder::new(cast_options, capacity)
                .with_target_type(data_type.clone()) // preserve timezone info
        ),

(If the array -> data -> builder -> data -> array conversion is super cheap, then we don't need the first match arm above)

I don't know a good way to factor out or isolate the other c ode paths, but at least this would prevent them from activating for any unexpected types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's better than the current solution, will improve it.

Seems ArrayDataBuiler::build may consume some time, will add a separate match arm for this

ArrayDataBuilder::build(self) -> Result<ArrayData, ArrowError> {
  ...
        if align_buffers {
            data.align_buffers();
        }

        // SAFETY: `skip_validation` is only set to true using `unsafe` APIs
        if !skip_validation.get() || cfg!(feature = "force_validate") {
            data.validate_data()?;
        }
...
}

Comment on lines 340 to 345
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)",
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)",
_ => "Unknown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the boolean case?

But actually, I suspect we can get just rid of get_type_name, relying on T::DATA_TYPE and the shiny new impl Display for DataType instead?

(that cleanup would probably be a separate PR tho)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the function get_type_name is only used in VariantToPrimitiveArrowRowBuilder, and Boolean is another separate builder, so I didn't add boolean here.

Will follow the cleanup logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue(#8538) to track the cleanup logic

Comment on lines 348 to 356
pub(crate) struct VariantToBooleanArrowRowBuilder<'a> {
builder: arrow::array::BooleanBuilder,
cast_options: &'a CastOptions<'a>,
}

impl<'a> VariantToBooleanArrowRowBuilder<'a> {
fn new(cast_options: &'a CastOptions<'a>, capacity: usize) -> Self {
Self {
builder: arrow::array::BooleanBuilder::with_capacity(capacity),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is virtually identical to the primitive row builder... AFAICT the only syntactic differences are:

  • The generics (or lack there of)
  • The array builder's type name
  • The method invoked to convert a variant value to the target type

I almost wonder if it would be worthwhile to define a macro that can cover both, similar to define_row_builder! macro in arrow_to_variant.rs?

  • Arrow decimal data types are not primitive and will eventually need similar treatment?
  • Could more easily define temporal builders with the necessary logic without "polluting" normal primitive builders (tho the custom finisher logic would be add extra complexity to an already complex macro, so maybe not worth it)

Invoking such a macro might look like:

define_row_builder!(
    struct VariantToBooleanArrowRowBuilder<'a> {
        builder: BooleanBuilder,
    },
    |v| v.as_boolean()
)

and

define_row_builder!(
    struct VariantToPrimitiveArrowRowBuilder<'a, T: ArrowPrimitiveType>
    where
        for<'m, 'v> Variant<'m, 'v>: VariantAsPrimitive<T>,
    {
        builder: PrimitiveBuilder<T>,
    },
    |v| v.as_primitive()
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this little exploration exposed a bad (IMO) trait design, I made a quick PR to fix it:

After that fix merges, the macro should be vastly easier to define -- basically copy+paste the definition from the other module and tweak it as needed.

And then you can use the macro to define not only timestamp and boolean builders, but also builders for the other temporal and decimal types.

We would still need to figure out the best way to handle that time zone issue, tho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still need to figure out the best way to handle that time zone issue, tho.

Oh... PrimitiveBuilder::with_timezone_opt looks rather ideal!

In terms of the other macro, the |array| "lambda" arg would change to something like this for normal primitives:

|capacity| -> PrimitiveArrayBuilder<T>  { PrimitiveArrayBuilder::with_capacity(capacity) }

and this for the timestamp types:

|capacity, tz: Option<Arc<str>>| -> PrimitiveArrayBuilder<T>  { 
    PrimitiveArrayBuilder::with_capacity(capacity).with_timezone_opt(tz)
}

... which "merely" requires teaching the "lambda" to accept additional optional args.

Thoughts?

Copy link
Contributor

@scovich scovich Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it all together, boolean would be:

define_row_builder!(
    struct VariantToBooleanArrowRowBuilder<'a>,
    |capacity| -> BooleanArrayBuilder { BooleanArrayBuilder::with_capacity(capacity) },
    |value| value.as_boolean()
)

while "simple" primitives would be:

define_row_builder!(
    struct VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>,
    |capacity| -> PrimitiveArrayBuilder<T> { PrimitiveArrayBuilder::with_capacity(capacity) },
    |value| T::from_variant(value)
)

... and timestamps would be:

define_row_builder!(
    struct VariantToTimestampArrowRowBuilder<'a, T: ArrowTimestampType>,
    |capacity, tz: Option<Arc<str>>| -> PrimitiveArrayBuilder<T>  { 
        PrimitiveArrayBuilder::with_capacity(capacity).with_timezone_opt(tz)
    },
    |value| value.as_int64()
)

NOTE: timestamp without timezone technically has the choice whether to use VariantToPrimitiveArrowRowBuilder or VariantToTimestampArrowRowBuilder -- both are defined and should produce the same result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the detailed description; it is definitely much better than the current solution. will try this way.

I've tried to make VariantToPrimitiveArrowRowBuilder generic by extracting the value.as_primitive in append_value to some converter so that boolean just needs to implement a converter(klion26@7533c7e), but stick with the current version as it did not seem much better.

alamb pushed a commit that referenced this pull request Sep 30, 2025
…8519)

# Which issue does this PR close?

- Relates to #8515
- Relates to #8516

# Rationale for this change

The existing `VariantAsPrimitive` trait is kind of "backward" and
requires very complex type bounds to use, e.g.:
```rust
impl<'a, T> VariantToPrimitiveArrowRowBuilder<'a, T>
where
    T: ArrowPrimitiveType,
    for<'m, 'v> Variant<'m, 'v>: VariantAsPrimitive<T>,
```

This is a code smell.

# What changes are included in this PR?

"Reverse" the trait -- instead of extending `Variant`, the trait extends
`T: ArrowPrimitiveType`. The resulting type bounds are much more
intuitive:
```rust
impl<'a, T: PrimitiveFromVariant> VariantToPrimitiveArrowRowBuilder<'a, T>
```

# Are these changes tested?

Existing unit tests cover this refactor.

# Are there any user-facing changes?

No.
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich Thank you very much for the detailed review. I'll try implementing the macro (reference to define_row_builder in arrow_to_variant.rs) first and will ask for your help if I can't do it. Thank you again!

Comment on lines 115 to 117
Variant::TimestampNanos(dt) => dt.timestamp_nanos_opt(),
Variant::TimestampNtzNanos(ndt) => ndt.and_utc().timestamp_nanos_opt(),
_ => None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can widen micros to nanos if wanted. will add this.

Comment on lines 124 to 127
// The number of days from 0001-01-01 to 1970-01-01.
const DAYS_FROM_CE_TO_UNIX_EPOCH: i32 = 719163;
self.as_naive_date()
.map(|d| d.num_days_from_ce() - DAYS_FROM_CE_TO_UNIX_EPOCH)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to use it.

Did not notice this function. Choosing the current solution seems to have a better performance(vs a handle write logic like Date32Type::from_naive_date). Using the existing function is better for maintenance, and we can improve the function if needed.

Float16(b) => b.append_value(value),
Float32(b) => b.append_value(value),
Float64(b) => b.append_value(value),
Boolean(b) => b.append_value(value),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, will change it

Comment on lines 340 to 345
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)",
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)",
_ => "Unknown",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the function get_type_name is only used in VariantToPrimitiveArrowRowBuilder, and Boolean is another separate builder, so I didn't add boolean here.

Will follow the cleanup logic.

Comment on lines 348 to 356
pub(crate) struct VariantToBooleanArrowRowBuilder<'a> {
builder: arrow::array::BooleanBuilder,
cast_options: &'a CastOptions<'a>,
}

impl<'a> VariantToBooleanArrowRowBuilder<'a> {
fn new(cast_options: &'a CastOptions<'a>, capacity: usize) -> Self {
Self {
builder: arrow::array::BooleanBuilder::with_capacity(capacity),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the detailed description; it is definitely much better than the current solution. will try this way.

I've tried to make VariantToPrimitiveArrowRowBuilder generic by extracting the value.as_primitive in append_value to some converter so that boolean just needs to implement a converter(klion26@7533c7e), but stick with the current version as it did not seem much better.

Comment on lines 402 to 406
fn new_with_target_type(
cast_options: &'a CastOptions<'a>,
capacity: usize,
target_data_type: Option<DataType>,
) -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's better than the current solution, will improve it.

Seems ArrayDataBuiler::build may consume some time, will add a separate match arm for this

ArrayDataBuilder::build(self) -> Result<ArrayData, ArrowError> {
  ...
        if align_buffers {
            data.align_buffers();
        }

        // SAFETY: `skip_validation` is only set to true using `unsafe` APIs
        if !skip_validation.get() || cfg!(feature = "force_validate") {
            data.validate_data()?;
        }
...
}

Comment on lines 445 to 452
let array: PrimitiveArray<T> = self.builder.finish();

if let Some(target_type) = self.target_data_type {
let data = array.into_data();
let new_data = data.into_builder().data_type(target_type).build()?;
let array_with_new_type = PrimitiveArray::<T>::from(new_data);
return Ok(Arc::new(array_with_new_type));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it

Comment on lines 215 to 221
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone());

TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type(
cast_options,
capacity,
Some(target_type),
))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

@klion26 klion26 force-pushed the typed_primitive branch 2 times, most recently from 7e7e843 to a003ed0 Compare October 2, 2025 13:42
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich I've updated the code, please take another look, thanks.

type_name: get_type_name::<T>()
);

define_variant_to_primitive_builder!(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of T needs to be ArrowTimestampType because the with_timezone_opt function only exists for ArrowTimestampType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't actually need a trait for that, and instead can do:

struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant + ArrowTimestampType>

PrimitiveFromVariant provides from_variant, and ArrowTimestampType provides with_timezone_opt.

However, the rust macro system is pretty bad at trait bounds involving multiple types, so we'd either need to add where support to the macro (similar to the other macro):

struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant>
where T: ArrowTimestampType,

or we'd keep TimestampFromVariant as a simple marker trait:

trait TimestampFromVariant: PrimitiveFromVariant + ArrowTimestampType {}

(where support in the macro is cleaner IMO)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like we do need the TimestampFromVariant trait, in order to handle UTC vs. NTZ timestamps. See other comment.

Comment on lines 340 to 345
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)",
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)",
_ => "Unknown",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue(#8538) to track the cleanup logic

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really nice!
I think it can be simplified even further?

fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
match variant {
$(
$variant_pattern => $conversion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I thought it wasn't possible for a macro to create match arms.
But I guess it works here because the macro creates the whole match?

Copy link
Member Author

@klion26 klion26 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but I did add _ => None at the end

Comment on lines 95 to 105
impl_timestamp_from_variant!(TimestampMicrosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros())),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())),
});

impl_timestamp_from_variant!(TimestampNanosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()),
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be Variant::as_timestamp[ntz][micros|nanos] methods?
Why defined here instead?

(defining it here means we can't just use impl_primitive_from_variant! macro)

Copy link
Member Author

@klion26 klion26 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added as_timestamp_micors/nanos in Variant.rs in a separate commit, and simply the impl_timestamp_from_variant for timestamp, so we can keep it or drop it easily.

I didn't add as_timestamp_micors/nonos in there because all of the as_xx functions in Variant.rs are used to extract the direct value(NaiveDateTime/DateTime for Variant::Timestamp) in the Variant currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is tricky, I see what you mean.

Stepping back a bit:

  • NaiveDateTime methods seem to be deprecated in favor of DateTime<Utc>
  • The NTZ types only pass NaiveDateTime as a marker (they all have to call and_utc() to produce a DateTime<Utc> before doing anything else)
  • We can safely widen micros to nanos
  • We can "safely" convert a TZ type to an NTZ type (discards the timezone info) but not the other way around (wouldn't know which TZ to infer). However, it is a narrowing conversion which the Variant::as_xxx methods usually try to avoid. Maybe we shouldn't allow TZ <-> NTZ conversions at all?
  • But arrow doesn't distinguish physically between TZ and NTZ (that information is embedded in the DataType enum variants, not e.g. TimestampMicrosecondType), so it needs a from_variant implementation that works for both.
  • ArrowTimestampType::make_value looks quite useful, and it takes a NaiveDateTime as input. So maybe the correct approach will be to add Variant::as_timestamp[_ntz]_[micros|nanos] methods, which respectively return DateTime<Utc> and NaiveDateTime, and then make TimestampFromVariant generic over const NTZ: bool so we can actually have all four implementations we need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can "safely" convert a TZ type to an NTZ type

No, maybe we can't do this, this will lead to the wrong result. The timestamp(the long value) for tz was calculated between the time with 1970-01-01 00:00:00 at +00:00, and NTZ was calculated between the time with 1970-01-01 00:00:00 in the local timezone.

But arrow doesn't distinguish physically between TZ and NTZ

IIUC, we don't need to distinguish these two when physically storing the value; they both are the timestamp between now and some time point (1970-01-01 00:00:00 at +00:00 for TZ, and 1970-01-01 00:00:00 in the local timezone` for NTZ )

So maybe the correct approach will be to add Variant::as_timestamp[ntz][micros|nanos] methods,

Separate the tz and ntz version Variant::as_timestamp[ntz][micro|nanos] that returns DateTime<Utc> and NaiveDateTime seems a better idea here.

Comment on lines 79 to 81
($timestamp_type:ty, {
$(($variant_pattern:pat, $conversion:expr)),+ $(,)?
}) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this works?

Suggested change
($timestamp_type:ty, {
$(($variant_pattern:pat, $conversion:expr)),+ $(,)?
}) => {
(
$timestamp_type:ty,
$( $variant_pattern:pat => $conversion:expr ),+ $(,)?
) => {

Use sites would look like this:

impl_timestamp_from_variant!(
    TimestampMicrosecondType,
    Variant::TimestampMicros(t) => Some(t.timestamp_micros()),
    Variant::TimestampNtzMicros(t) => Some(t.and_utc().timestamp_micros()),
);

Comment on lines 348 to 350
$array_param: usize
// add this so that $init_expr can use it
$(, $field: $field_type)?) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know nobody actually sees the emitted code, but it seems nicer to do normal commas here?

Suggested change
$array_param: usize
// add this so that $init_expr can use it
$(, $field: $field_type)?) -> Self {
$array_param: usize,
// add this so that $init_expr can use it
$( $field: $field_type, )?
) -> Self {

@klion26 klion26 requested a review from scovich October 3, 2025 06:28
@klion26
Copy link
Member Author

klion26 commented Oct 3, 2025

@scovich thanks for the review, I've updated the code, please take another look, thanks.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, timezones are so tricky and annoying sometimes...

type_name: get_type_name::<T>()
);

define_variant_to_primitive_builder!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't actually need a trait for that, and instead can do:

struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant + ArrowTimestampType>

PrimitiveFromVariant provides from_variant, and ArrowTimestampType provides with_timezone_opt.

However, the rust macro system is pretty bad at trait bounds involving multiple types, so we'd either need to add where support to the macro (similar to the other macro):

struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant>
where T: ArrowTimestampType,

or we'd keep TimestampFromVariant as a simple marker trait:

trait TimestampFromVariant: PrimitiveFromVariant + ArrowTimestampType {}

(where support in the macro is cleaner IMO)

Comment on lines 58 to 67
};
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => {
impl TimestampFromVariant for $arrow_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
match variant {
$(
$variant_type => variant.$variant_method().map($cast_fn),
)+
_ => None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
};
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => {
impl TimestampFromVariant for $arrow_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
match variant {
$(
$variant_type => variant.$variant_method().map($cast_fn),
)+
_ => None
}
};
}
macro_rules! impl_timestamp_from_variant {
($timestamp_type:ty, $variant_method:ident, ntz=$ntz) => {
impl TimestampFromVariant<$ntz> for $timestamp_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
variant.$variant_method().map(Self::make_value)

where

pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType {

and

/// Extension trait that lets `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime`
trait MakeValueTz: ArrowTimestampType {
    fn make_value(timestamp: DateTime<Utc>) -> Option<i64> {
        Self::make_value(*timestamp_type.naive_utc())
    }
}
impl<T: ArrowTimestampType> MakeValueTz for T {}

See #8516 (comment) for context -- this would take a generic const bool param, and would no longer need the pattern matching because each impl could call the appropriate Variant::as_xxx method, e.g.:

impl_timestamp_from_variant!(
    datatypes::TimestampMicrosecondType, 
    as_timestamp_ntz_micros, 
    ntz=true
);
impl_timestamp_from_variant!(
    datatypes::TimestampMicrosecondType, 
    as_timestamp_micros, 
    ntz=false
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above approach has one big annoyance, tho -- it would require defining two timestamp row builder types, one timezone-aware and the other not (= four enum variants). Where the current code only needs one (= two enum variants).

define_variant_to_primitive_builder
    struct VariantToTimestampArrowRowBuilder<'a, T: TimestampFromVariant<false>>
    |capacity, tz: Arc<str> | -> PrimitiveBuilder<T> {
        PrimitiveBuilder::<T>::with_capacity(capacity).with_timezone(tz)
    },
    |value| T::from_variant(value),
    type_name: get_type_name::<T>()
);

define_variant_to_primitive_builder
    struct VariantToTimestampNtzArrowRowBuilder<'a, T: TimestampFromVariant<true>>
    |capacity| -> PrimitiveBuilder<T> { PrimitiveBuilder::<T>::with_capacity(capacity) },
    |value| T::from_variant(value),
    type_name: get_type_name::<T>()
);

But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too narrow of a waist and lose information.

@alamb what's your take? Am I overthinking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try with this way.

But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too narrow of a waist and lose information.

Does this mean the as_timestamp_xx itself or the end-to-end of the variant to arrow here? If it's the former, yes, it may be wrong (or maybe we can treat the return value as the physically stored value), if it's the latter, we'll attach the timezone info when initializing the builder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can't treat the return value as the physically stored value, the Enum Variant contains TimestampMicros/TimestampNtzMicros/TimestampNanos/TimestampNtzNanos, they contains NaiveDatetime/DateTime<Utc>, if we return the underlying pyysically stored value, it will confuse the caller of these functions.

Comment on lines 95 to 105
impl_timestamp_from_variant!(TimestampMicrosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros())),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())),
});

impl_timestamp_from_variant!(TimestampNanosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()),
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is tricky, I see what you mean.

Stepping back a bit:

  • NaiveDateTime methods seem to be deprecated in favor of DateTime<Utc>
  • The NTZ types only pass NaiveDateTime as a marker (they all have to call and_utc() to produce a DateTime<Utc> before doing anything else)
  • We can safely widen micros to nanos
  • We can "safely" convert a TZ type to an NTZ type (discards the timezone info) but not the other way around (wouldn't know which TZ to infer). However, it is a narrowing conversion which the Variant::as_xxx methods usually try to avoid. Maybe we shouldn't allow TZ <-> NTZ conversions at all?
  • But arrow doesn't distinguish physically between TZ and NTZ (that information is embedded in the DataType enum variants, not e.g. TimestampMicrosecondType), so it needs a from_variant implementation that works for both.
  • ArrowTimestampType::make_value looks quite useful, and it takes a NaiveDateTime as input. So maybe the correct approach will be to add Variant::as_timestamp[_ntz]_[micros|nanos] methods, which respectively return DateTime<Utc> and NaiveDateTime, and then make TimestampFromVariant generic over const NTZ: bool so we can actually have all four implementations we need.

type_name: get_type_name::<T>()
);

define_variant_to_primitive_builder!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like we do need the TimestampFromVariant trait, in order to handle UTC vs. NTZ timestamps. See other comment.

Comment on lines 589 to 592
pub fn as_timestamp_micros(&self) -> Option<i64> {
match *self {
Variant::TimestampMicros(d) => Some(d.timestamp_micros()),
Variant::TimestampNtzMicros(d) => Some(d.and_utc().timestamp_micros()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lossy conversion... we probably need to keep separate normal vs. ntz versions of these methods, where e.g. as_timestamp_micros is not willing to work with Variant::TimestampNtzMicros. TBD whether e.g. as_timestamp_ntz_micros is willing to return a Variant::TimestampMicros (well-defined but technically lossy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I fully understand this right. If the lossy here means that we lost the timezone info, yes, it is. The timestamp here means the physically stored value(with type long) for the NaiveDateTime and DateTime<Utc>. If we return Option<NaiveDateTime>/Option<DateTime<Utc>> separate the ntz and tz versions is a better idea, but when the return value is Option<i64>(the underlying timestamp long value) then separate or not is the same?

Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich, thanks for the review. will update the code according to the suggestion.

Comment on lines 589 to 592
pub fn as_timestamp_micros(&self) -> Option<i64> {
match *self {
Variant::TimestampMicros(d) => Some(d.timestamp_micros()),
Variant::TimestampNtzMicros(d) => Some(d.and_utc().timestamp_micros()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I fully understand this right. If the lossy here means that we lost the timezone info, yes, it is. The timestamp here means the physically stored value(with type long) for the NaiveDateTime and DateTime<Utc>. If we return Option<NaiveDateTime>/Option<DateTime<Utc>> separate the ntz and tz versions is a better idea, but when the return value is Option<i64>(the underlying timestamp long value) then separate or not is the same?

Comment on lines 58 to 67
};
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => {
impl TimestampFromVariant for $arrow_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
match variant {
$(
$variant_type => variant.$variant_method().map($cast_fn),
)+
_ => None
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try with this way.

But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too narrow of a waist and lose information.

Does this mean the as_timestamp_xx itself or the end-to-end of the variant to arrow here? If it's the former, yes, it may be wrong (or maybe we can treat the return value as the physically stored value), if it's the latter, we'll attach the timezone info when initializing the builder

Comment on lines 95 to 105
impl_timestamp_from_variant!(TimestampMicrosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros())),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())),
});

impl_timestamp_from_variant!(TimestampNanosecondType, {
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)),
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()),
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()),
});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can "safely" convert a TZ type to an NTZ type

No, maybe we can't do this, this will lead to the wrong result. The timestamp(the long value) for tz was calculated between the time with 1970-01-01 00:00:00 at +00:00, and NTZ was calculated between the time with 1970-01-01 00:00:00 in the local timezone.

But arrow doesn't distinguish physically between TZ and NTZ

IIUC, we don't need to distinguish these two when physically storing the value; they both are the timestamp between now and some time point (1970-01-01 00:00:00 at +00:00 for TZ, and 1970-01-01 00:00:00 in the local timezone` for NTZ )

So maybe the correct approach will be to add Variant::as_timestamp[ntz][micros|nanos] methods,

Separate the tz and ntz version Variant::as_timestamp[ntz][micro|nanos] that returns DateTime<Utc> and NaiveDateTime seems a better idea here.

Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich I've updated the code, please take another look, thanks.

datatypes::TimestampMicrosecondType,
as_timestamp_ntz_micros,
ntz = true,
Self::make_value,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here use Self::make_value instread of make_value because I think Self::make_value is the whole function name, can change if needed

}

macro_rules! impl_timestamp_from_variant {
($timestamp_type:ty, $variant_method:ident, ntz=$ntz:ident, $cast_fn:expr $(,)?) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, I removed the micro -> nano, because I'm not sure how to define the macro_rules here,

the macro rule in my head now is something like below

 ($timestamp_type:ty, $variant_method_a:ident, $(opt=$variant_method_b:ident,)? ntz=$ntz:ident, $cast_fn:expr $(,)?) => {
        impl TimestampFromVariant<{ $ntz }> for $timestamp_type {
            fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
                let value = variant.$variant_method_a();
                if value.is_some() {
                    return value.and_then($cast_fn);
                } else {
                    #[allow(unused_mut)]
                    let mut value = None;
                    $(
                        value = variant.$variant_method_b();
                    )?
                    return value.and_then($cast_fn);
                }
            }
        }
    };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The micro -> nano conversion actually happens inside as_timestamp_[ntz_]nanos, I left a comment there about it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me -- thank you @klion26 and @scovich

I didn't follow along with the details, but this PR seems like a step in the right direction to me, and we can always adjust in the future if we find we want to change something

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

I also found the code and comments and tests in this PR easy to read and understand. Thank you for the great work

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Few nits to consider.

Comment on lines 43 to 45
/// We can't use [`PrimitiveFromVariant`] directly because we might need to use methods that
/// are only available on [`ArrowTimestampType`] (such as with_timezone_opt)
pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: AFAIK, with_timezone_opt isn't the reason we need this trait -- we could always use the trait bound PrimitiveFromVariant + ArrowTimestampType to get that. The problem is, we need two implementations for each type -- captured by the NTZ param here -- which the other trait cannot express

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/// use chrono::NaiveDate;
///
/// // you can extract a DateTime<Utc> from a UTC-adjusted variant
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_nano_opt(12, 34, 56, 789123456).unwrap().and_utc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: really long line...

Suggested change
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_nano_opt(12, 34, 56, 789123456).unwrap().and_utc();
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16)
/// .unwrap()
/// .and_hms_nano_opt(12, 34, 56, 789123456)
/// .unwrap()
/// .and_utc();

(several others in the doc tests)

/// ```
pub fn as_timestamp_nanos(&self) -> Option<DateTime<Utc>> {
match *self {
Variant::TimestampNanos(d) => Some(d),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Variant::TimestampNanos(d) => Some(d),
Variant::TimestampNanos(d) | Variant::TimestampMicros(d) => Some(d),

(they both store DateTime<Utc>)

(again below for ntz nanos)

}

macro_rules! impl_timestamp_from_variant {
($timestamp_type:ty, $variant_method:ident, ntz=$ntz:ident, $cast_fn:expr $(,)?) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The micro -> nano conversion actually happens inside as_timestamp_[ntz_]nanos, I left a comment there about it.


/// Extension trait that `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime`
trait MakeValueTz: ArrowTimestampType {
fn make_value_tz(timestamp: DateTime<Utc>) -> Option<i64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this name is fine, but because it's a different trait we should also be able to "overload" make_value if you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for didn't add a comment for this, use make_value_tz here because it can't compile if I try to "overload" make_value, the compiler needs me to convert type to a specific type in MakeValueTz::make_value and macro impl_timestamp_from_variant(something like <Self as ArrowTimestampType>::make_value(...) and <Self as MakeValueTz>::make_value(...). and googled that seems Rust didn't support "overload" with same func name and different parameters. not sure if I missed anything here

Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich I've addressed most of your comments, and reply for the MakeValueTz::make_value_tz name, please take another look, thanks.


/// Extension trait that `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime`
trait MakeValueTz: ArrowTimestampType {
fn make_value_tz(timestamp: DateTime<Utc>) -> Option<i64> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for didn't add a comment for this, use make_value_tz here because it can't compile if I try to "overload" make_value, the compiler needs me to convert type to a specific type in MakeValueTz::make_value and macro impl_timestamp_from_variant(something like <Self as ArrowTimestampType>::make_value(...) and <Self as MakeValueTz>::make_value(...). and googled that seems Rust didn't support "overload" with same func name and different parameters. not sure if I missed anything here

Comment on lines 43 to 45
/// We can't use [`PrimitiveFromVariant`] directly because we might need to use methods that
/// are only available on [`ArrowTimestampType`] (such as with_timezone_opt)
pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Add variant to arrow primitive support for boolean/timestamp/time
3 participants