-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Add variant to arrow primitive support for boolean/timestamp/time #8516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2fc17be
to
d6ee3ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this issue -- it's turning out to be more involved than I would have guessed. See in particular my comments about using a macro to define these builders, which should bring flexibility without boilerplate.
Note: I'm happy to help with macro-related things, as guided by your comfort level with macro magic. Just let me know how you'd prefer to approach this.
Variant::TimestampNanos(dt) => dt.timestamp_nanos_opt(), | ||
Variant::TimestampNtzNanos(ndt) => ndt.and_utc().timestamp_nanos_opt(), | ||
_ => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also widen micros to nanos here, if we wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can widen micros to nanos if wanted. will add this.
// The number of days from 0001-01-01 to 1970-01-01. | ||
const DAYS_FROM_CE_TO_UNIX_EPOCH: i32 = 719163; | ||
self.as_naive_date() | ||
.map(|d| d.num_days_from_ce() - DAYS_FROM_CE_TO_UNIX_EPOCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use Date32Type::to_naive_date
?
It has unwrap
calls internally, but I'm fairly certain it cannot actually panic because i32 range isn't big enough to overflow anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to use it.
Did not notice this function. Choosing the current solution seems to have a better performance(vs a handle write logic like Date32Type::from_naive_date
). Using the existing function is better for maintenance, and we can improve the function if needed.
Float16(b) => b.append_value(value), | ||
Float32(b) => b.append_value(value), | ||
Float64(b) => b.append_value(value), | ||
Boolean(b) => b.append_value(value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not ordered like the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, will change it
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone()); | ||
|
||
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | ||
cast_options, | ||
capacity, | ||
Some(target_type), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's actually smaller code (fewer lines and shorter lines) to not factor out target_type
?
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone()); | |
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | |
cast_options, | |
capacity, | |
Some(target_type), | |
)) | |
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | |
cast_options, | |
capacity, | |
Some(DataType::Timestamp(TimeUnit::Microsecond, tz.clone())), | |
)) |
(again below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but actually, isn't this just:
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone()); | |
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | |
cast_options, | |
capacity, | |
Some(target_type), | |
)) | |
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | |
cast_options, | |
capacity, | |
Some(data_type.clone()) | |
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
if let Some(target_type) = self.target_data_type { | ||
let data = array.into_data(); | ||
let new_data = data.into_builder().data_type(target_type).build()?; | ||
let array_with_new_type = PrimitiveArray::<T>::from(new_data); | ||
return Ok(Arc::new(array_with_new_type)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand why we need this juggling of data types, when all the temporal types impl ArrowPrimitiveType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! It's reinstating timezone info that would otherwise be lost
let array: PrimitiveArray<T> = self.builder.finish(); | ||
|
||
if let Some(target_type) = self.target_data_type { | ||
let data = array.into_data(); | ||
let new_data = data.into_builder().data_type(target_type).build()?; | ||
let array_with_new_type = PrimitiveArray::<T>::from(new_data); | ||
return Ok(Arc::new(array_with_new_type)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let array: PrimitiveArray<T> = self.builder.finish(); | |
if let Some(target_type) = self.target_data_type { | |
let data = array.into_data(); | |
let new_data = data.into_builder().data_type(target_type).build()?; | |
let array_with_new_type = PrimitiveArray::<T>::from(new_data); | |
return Ok(Arc::new(array_with_new_type)); | |
} | |
let mut array = self.builder.finish(); | |
if let Some(target_type) = self.target_data_type { | |
let data = array.into_data(); | |
let new_data = data.into_builder().data_type(target_type).build()?; | |
array = new_data.into(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix it
fn new_with_target_type( | ||
cast_options: &'a CastOptions<'a>, | ||
capacity: usize, | ||
target_data_type: Option<DataType>, | ||
) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer/cleaner to limit this to timestamp types only?
impl<'a, T: ArrowTimestampType> VariantToPrimitiveArrowRowBuilder<'a, T> {
fn with_target_type(mut self, data_type: DataType) -> Self {
self.target_data_type = Some(data_type)
self
}
}
and then e.g.
DataType::Timestamp(TimeUnit::Microsecond, None) => TimestampMicro(
VariantToPrimitiveArrowRowBuilder::new(cast_options, capacity)
),
DataType::Timestamp(TimeUnit::Microsecond, _) => TimestampMicro(
VariantToPrimitiveArrowRowBuilder::new(cast_options, capacity)
.with_target_type(data_type.clone()) // preserve timezone info
),
(If the array -> data -> builder -> data -> array conversion is super cheap, then we don't need the first match arm above)
I don't know a good way to factor out or isolate the other c ode paths, but at least this would prevent them from activating for any unexpected types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's better than the current solution, will improve it.
Seems ArrayDataBuiler::build
may consume some time, will add a separate match arm for this
ArrayDataBuilder::build(self) -> Result<ArrayData, ArrowError> {
...
if align_buffers {
data.align_buffers();
}
// SAFETY: `skip_validation` is only set to true using `unsafe` APIs
if !skip_validation.get() || cfg!(feature = "force_validate") {
data.validate_data()?;
}
...
}
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)", | ||
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)", | ||
_ => "Unknown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the boolean case?
But actually, I suspect we can get just rid of get_type_name
, relying on T::DATA_TYPE
and the shiny new impl Display for DataType
instead?
(that cleanup would probably be a separate PR tho)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the function get_type_name
is only used in VariantToPrimitiveArrowRowBuilder
, and Boolean
is another separate builder, so I didn't add boolean
here.
Will follow the cleanup logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue(#8538) to track the cleanup logic
pub(crate) struct VariantToBooleanArrowRowBuilder<'a> { | ||
builder: arrow::array::BooleanBuilder, | ||
cast_options: &'a CastOptions<'a>, | ||
} | ||
|
||
impl<'a> VariantToBooleanArrowRowBuilder<'a> { | ||
fn new(cast_options: &'a CastOptions<'a>, capacity: usize) -> Self { | ||
Self { | ||
builder: arrow::array::BooleanBuilder::with_capacity(capacity), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is virtually identical to the primitive row builder... AFAICT the only syntactic differences are:
- The generics (or lack there of)
- The array builder's type name
- The method invoked to convert a variant value to the target type
I almost wonder if it would be worthwhile to define a macro that can cover both, similar to define_row_builder!
macro in arrow_to_variant.rs?
- Arrow decimal data types are not primitive and will eventually need similar treatment?
- Could more easily define temporal builders with the necessary logic without "polluting" normal primitive builders (tho the custom finisher logic would be add extra complexity to an already complex macro, so maybe not worth it)
Invoking such a macro might look like:
define_row_builder!(
struct VariantToBooleanArrowRowBuilder<'a> {
builder: BooleanBuilder,
},
|v| v.as_boolean()
)
and
define_row_builder!(
struct VariantToPrimitiveArrowRowBuilder<'a, T: ArrowPrimitiveType>
where
for<'m, 'v> Variant<'m, 'v>: VariantAsPrimitive<T>,
{
builder: PrimitiveBuilder<T>,
},
|v| v.as_primitive()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this little exploration exposed a bad (IMO) trait design, I made a quick PR to fix it:
After that fix merges, the macro should be vastly easier to define -- basically copy+paste the definition from the other module and tweak it as needed.
And then you can use the macro to define not only timestamp and boolean builders, but also builders for the other temporal and decimal types.
We would still need to figure out the best way to handle that time zone issue, tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would still need to figure out the best way to handle that time zone issue, tho.
Oh... PrimitiveBuilder::with_timezone_opt looks rather ideal!
In terms of the other macro, the |array|
"lambda" arg would change to something like this for normal primitives:
|capacity| -> PrimitiveArrayBuilder<T> { PrimitiveArrayBuilder::with_capacity(capacity) }
and this for the timestamp types:
|capacity, tz: Option<Arc<str>>| -> PrimitiveArrayBuilder<T> {
PrimitiveArrayBuilder::with_capacity(capacity).with_timezone_opt(tz)
}
... which "merely" requires teaching the "lambda" to accept additional optional args.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it all together, boolean would be:
define_row_builder!(
struct VariantToBooleanArrowRowBuilder<'a>,
|capacity| -> BooleanArrayBuilder { BooleanArrayBuilder::with_capacity(capacity) },
|value| value.as_boolean()
)
while "simple" primitives would be:
define_row_builder!(
struct VariantToPrimitiveArrowRowBuilder<'a, T: PrimitiveFromVariant>,
|capacity| -> PrimitiveArrayBuilder<T> { PrimitiveArrayBuilder::with_capacity(capacity) },
|value| T::from_variant(value)
)
... and timestamps would be:
define_row_builder!(
struct VariantToTimestampArrowRowBuilder<'a, T: ArrowTimestampType>,
|capacity, tz: Option<Arc<str>>| -> PrimitiveArrayBuilder<T> {
PrimitiveArrayBuilder::with_capacity(capacity).with_timezone_opt(tz)
},
|value| value.as_int64()
)
NOTE: timestamp without timezone technically has the choice whether to use VariantToPrimitiveArrowRowBuilder
or VariantToTimestampArrowRowBuilder
-- both are defined and should produce the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the detailed description; it is definitely much better than the current solution. will try this way.
I've tried to make VariantToPrimitiveArrowRowBuilder
generic by extracting the value.as_primitive
in append_value
to some converter
so that boolean just needs to implement a converter
(klion26@7533c7e), but stick with the current version as it did not seem much better.
…8519) # Which issue does this PR close? - Relates to #8515 - Relates to #8516 # Rationale for this change The existing `VariantAsPrimitive` trait is kind of "backward" and requires very complex type bounds to use, e.g.: ```rust impl<'a, T> VariantToPrimitiveArrowRowBuilder<'a, T> where T: ArrowPrimitiveType, for<'m, 'v> Variant<'m, 'v>: VariantAsPrimitive<T>, ``` This is a code smell. # What changes are included in this PR? "Reverse" the trait -- instead of extending `Variant`, the trait extends `T: ArrowPrimitiveType`. The resulting type bounds are much more intuitive: ```rust impl<'a, T: PrimitiveFromVariant> VariantToPrimitiveArrowRowBuilder<'a, T> ``` # Are these changes tested? Existing unit tests cover this refactor. # Are there any user-facing changes? No.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich Thank you very much for the detailed review. I'll try implementing the macro (reference to define_row_builder
in arrow_to_variant.rs
) first and will ask for your help if I can't do it. Thank you again!
Variant::TimestampNanos(dt) => dt.timestamp_nanos_opt(), | ||
Variant::TimestampNtzNanos(ndt) => ndt.and_utc().timestamp_nanos_opt(), | ||
_ => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can widen micros to nanos if wanted. will add this.
// The number of days from 0001-01-01 to 1970-01-01. | ||
const DAYS_FROM_CE_TO_UNIX_EPOCH: i32 = 719163; | ||
self.as_naive_date() | ||
.map(|d| d.num_days_from_ce() - DAYS_FROM_CE_TO_UNIX_EPOCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to use it.
Did not notice this function. Choosing the current solution seems to have a better performance(vs a handle write logic like Date32Type::from_naive_date
). Using the existing function is better for maintenance, and we can improve the function if needed.
Float16(b) => b.append_value(value), | ||
Float32(b) => b.append_value(value), | ||
Float64(b) => b.append_value(value), | ||
Boolean(b) => b.append_value(value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, will change it
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)", | ||
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)", | ||
_ => "Unknown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the function get_type_name
is only used in VariantToPrimitiveArrowRowBuilder
, and Boolean
is another separate builder, so I didn't add boolean
here.
Will follow the cleanup logic.
pub(crate) struct VariantToBooleanArrowRowBuilder<'a> { | ||
builder: arrow::array::BooleanBuilder, | ||
cast_options: &'a CastOptions<'a>, | ||
} | ||
|
||
impl<'a> VariantToBooleanArrowRowBuilder<'a> { | ||
fn new(cast_options: &'a CastOptions<'a>, capacity: usize) -> Self { | ||
Self { | ||
builder: arrow::array::BooleanBuilder::with_capacity(capacity), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the detailed description; it is definitely much better than the current solution. will try this way.
I've tried to make VariantToPrimitiveArrowRowBuilder
generic by extracting the value.as_primitive
in append_value
to some converter
so that boolean just needs to implement a converter
(klion26@7533c7e), but stick with the current version as it did not seem much better.
fn new_with_target_type( | ||
cast_options: &'a CastOptions<'a>, | ||
capacity: usize, | ||
target_data_type: Option<DataType>, | ||
) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's better than the current solution, will improve it.
Seems ArrayDataBuiler::build
may consume some time, will add a separate match arm for this
ArrayDataBuilder::build(self) -> Result<ArrayData, ArrowError> {
...
if align_buffers {
data.align_buffers();
}
// SAFETY: `skip_validation` is only set to true using `unsafe` APIs
if !skip_validation.get() || cfg!(feature = "force_validate") {
data.validate_data()?;
}
...
}
let array: PrimitiveArray<T> = self.builder.finish(); | ||
|
||
if let Some(target_type) = self.target_data_type { | ||
let data = array.into_data(); | ||
let new_data = data.into_builder().data_type(target_type).build()?; | ||
let array_with_new_type = PrimitiveArray::<T>::from(new_data); | ||
return Ok(Arc::new(array_with_new_type)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix it
let target_type = DataType::Timestamp(TimeUnit::Microsecond, tz.clone()); | ||
|
||
TimestampMicro(VariantToPrimitiveArrowRowBuilder::new_with_target_type( | ||
cast_options, | ||
capacity, | ||
Some(target_type), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it
7e7e843
to
a003ed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich I've updated the code, please take another look, thanks.
type_name: get_type_name::<T>() | ||
); | ||
|
||
define_variant_to_primitive_builder!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of T
needs to be ArrowTimestampType
because the with_timezone_opt
function only exists for ArrowTimestampType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we don't actually need a trait for that, and instead can do:
struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant + ArrowTimestampType>
PrimitiveFromVariant
provides from_variant
, and ArrowTimestampType
provides with_timezone_opt
.
However, the rust macro system is pretty bad at trait bounds involving multiple types, so we'd either need to add where
support to the macro (similar to the other macro):
struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant>
where T: ArrowTimestampType,
or we'd keep TimestampFromVariant
as a simple marker trait:
trait TimestampFromVariant: PrimitiveFromVariant + ArrowTimestampType {}
(where
support in the macro is cleaner IMO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it looks like we do need the TimestampFromVariant
trait, in order to handle UTC vs. NTZ timestamps. See other comment.
"arrow_array::types::TimestampMicrosecondType" => "Timestamp(Microsecond)", | ||
"arrow_array::types::TimestampNanosecondType" => "Timestamp(Nanosecond)", | ||
_ => "Unknown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue(#8538) to track the cleanup logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really nice!
I think it can be simplified even further?
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> { | ||
match variant { | ||
$( | ||
$variant_pattern => $conversion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I thought it wasn't possible for a macro to create match arms.
But I guess it works here because the macro creates the whole match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this before, but I did add _ => None
at the end
impl_timestamp_from_variant!(TimestampMicrosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros())), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())), | ||
}); | ||
|
||
impl_timestamp_from_variant!(TimestampNanosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()), | ||
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be Variant::as_timestamp[ntz][micros|nanos] methods?
Why defined here instead?
(defining it here means we can't just use impl_primitive_from_variant!
macro)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added as_timestamp_micors/nanos
in Variant.rs
in a separate commit, and simply the impl_timestamp_from_variant
for timestamp, so we can keep it or drop it easily.
I didn't add as_timestamp_micors/nonos
in there because all of the as_xx
functions in Variant.rs
are used to extract the direct value(NaiveDateTime/DateTime for Variant::Timestamp) in the Variant currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is tricky, I see what you mean.
Stepping back a bit:
NaiveDateTime
methods seem to be deprecated in favor ofDateTime<Utc>
- The NTZ types only pass
NaiveDateTime
as a marker (they all have to calland_utc()
to produce aDateTime<Utc>
before doing anything else) - We can safely widen micros to nanos
- We can "safely" convert a TZ type to an NTZ type (discards the timezone info) but not the other way around (wouldn't know which TZ to infer). However, it is a narrowing conversion which the
Variant::as_xxx
methods usually try to avoid. Maybe we shouldn't allow TZ <-> NTZ conversions at all? - But arrow doesn't distinguish physically between TZ and NTZ (that information is embedded in the
DataType
enum variants, not e.g.TimestampMicrosecondType
), so it needs afrom_variant
implementation that works for both. - ArrowTimestampType::make_value looks quite useful, and it takes a
NaiveDateTime
as input. So maybe the correct approach will be to addVariant::as_timestamp[_ntz]_[micros|nanos]
methods, which respectively returnDateTime<Utc>
andNaiveDateTime
, and then makeTimestampFromVariant
generic overconst NTZ: bool
so we can actually have all four implementations we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can "safely" convert a TZ type to an NTZ type
No, maybe we can't do this, this will lead to the wrong result. The timestamp(the long value) for tz was calculated between the time with 1970-01-01 00:00:00 at +00:00
, and NTZ was calculated between the time with 1970-01-01 00:00:00 in the local timezone
.
But arrow doesn't distinguish physically between TZ and NTZ
IIUC, we don't need to distinguish these two when physically storing the value; they both are the timestamp between now and some time point (1970-01-01 00:00:00 at +00:00 for TZ, and
1970-01-01 00:00:00 in the local timezone` for NTZ )
So maybe the correct approach will be to add Variant::as_timestamp[ntz][micros|nanos] methods,
Separate the tz and ntz version Variant::as_timestamp[ntz][micro|nanos] that returns DateTime<Utc>
and NaiveDateTime
seems a better idea here.
($timestamp_type:ty, { | ||
$(($variant_pattern:pat, $conversion:expr)),+ $(,)? | ||
}) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this works?
($timestamp_type:ty, { | |
$(($variant_pattern:pat, $conversion:expr)),+ $(,)? | |
}) => { | |
( | |
$timestamp_type:ty, | |
$( $variant_pattern:pat => $conversion:expr ),+ $(,)? | |
) => { |
Use sites would look like this:
impl_timestamp_from_variant!(
TimestampMicrosecondType,
Variant::TimestampMicros(t) => Some(t.timestamp_micros()),
Variant::TimestampNtzMicros(t) => Some(t.and_utc().timestamp_micros()),
);
$array_param: usize | ||
// add this so that $init_expr can use it | ||
$(, $field: $field_type)?) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know nobody actually sees the emitted code, but it seems nicer to do normal commas here?
$array_param: usize | |
// add this so that $init_expr can use it | |
$(, $field: $field_type)?) -> Self { | |
$array_param: usize, | |
// add this so that $init_expr can use it | |
$( $field: $field_type, )? | |
) -> Self { |
6040b40
to
807ac5c
Compare
@scovich thanks for the review, I've updated the code, please take another look, thanks. |
6ea256c
to
454810b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, timezones are so tricky and annoying sometimes...
type_name: get_type_name::<T>() | ||
); | ||
|
||
define_variant_to_primitive_builder!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we don't actually need a trait for that, and instead can do:
struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant + ArrowTimestampType>
PrimitiveFromVariant
provides from_variant
, and ArrowTimestampType
provides with_timezone_opt
.
However, the rust macro system is pretty bad at trait bounds involving multiple types, so we'd either need to add where
support to the macro (similar to the other macro):
struct VariantToTimestampArrowRowBuilder<'a, T:PrimitiveFromVariant>
where T: ArrowTimestampType,
or we'd keep TimestampFromVariant
as a simple marker trait:
trait TimestampFromVariant: PrimitiveFromVariant + ArrowTimestampType {}
(where
support in the macro is cleaner IMO)
}; | ||
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => { | ||
impl TimestampFromVariant for $arrow_type { | ||
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> { | ||
match variant { | ||
$( | ||
$variant_type => variant.$variant_method().map($cast_fn), | ||
)+ | ||
_ => None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | |
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => { | |
impl TimestampFromVariant for $arrow_type { | |
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> { | |
match variant { | |
$( | |
$variant_type => variant.$variant_method().map($cast_fn), | |
)+ | |
_ => None | |
} | |
}; | |
} | |
macro_rules! impl_timestamp_from_variant { | |
($timestamp_type:ty, $variant_method:ident, ntz=$ntz) => { | |
impl TimestampFromVariant<$ntz> for $timestamp_type { | |
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> { | |
variant.$variant_method().map(Self::make_value) |
where
pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType {
and
/// Extension trait that lets `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime`
trait MakeValueTz: ArrowTimestampType {
fn make_value(timestamp: DateTime<Utc>) -> Option<i64> {
Self::make_value(*timestamp_type.naive_utc())
}
}
impl<T: ArrowTimestampType> MakeValueTz for T {}
See #8516 (comment) for context -- this would take a generic const bool param, and would no longer need the pattern matching because each impl could call the appropriate Variant::as_xxx
method, e.g.:
impl_timestamp_from_variant!(
datatypes::TimestampMicrosecondType,
as_timestamp_ntz_micros,
ntz=true
);
impl_timestamp_from_variant!(
datatypes::TimestampMicrosecondType,
as_timestamp_micros,
ntz=false
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above approach has one big annoyance, tho -- it would require defining two timestamp row builder types, one timezone-aware and the other not (= four enum variants). Where the current code only needs one (= two enum variants).
define_variant_to_primitive_builder
struct VariantToTimestampArrowRowBuilder<'a, T: TimestampFromVariant<false>>
|capacity, tz: Arc<str> | -> PrimitiveBuilder<T> {
PrimitiveBuilder::<T>::with_capacity(capacity).with_timezone(tz)
},
|value| T::from_variant(value),
type_name: get_type_name::<T>()
);
define_variant_to_primitive_builder
struct VariantToTimestampNtzArrowRowBuilder<'a, T: TimestampFromVariant<true>>
|capacity| -> PrimitiveBuilder<T> { PrimitiveBuilder::<T>::with_capacity(capacity) },
|value| T::from_variant(value),
type_name: get_type_name::<T>()
);
But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx
methods are too narrow of a waist and lose information.
@alamb what's your take? Am I overthinking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll try with this way.
But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too narrow of a waist and lose information.
Does this mean the as_timestamp_xx
itself or the end-to-end of the variant to arrow here? If it's the former, yes, it may be wrong (or maybe we can treat the return value as the physically stored value), if it's the latter, we'll attach the timezone info when initializing the builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we can't treat the return value as the physically stored value, the Enum Variant contains TimestampMicros/TimestampNtzMicros/TimestampNanos/TimestampNtzNanos
, they contains NaiveDatetime/DateTime<Utc>
, if we return the underlying pyysically stored value, it will confuse the caller of these functions.
impl_timestamp_from_variant!(TimestampMicrosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros())), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())), | ||
}); | ||
|
||
impl_timestamp_from_variant!(TimestampNanosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()), | ||
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is tricky, I see what you mean.
Stepping back a bit:
NaiveDateTime
methods seem to be deprecated in favor ofDateTime<Utc>
- The NTZ types only pass
NaiveDateTime
as a marker (they all have to calland_utc()
to produce aDateTime<Utc>
before doing anything else) - We can safely widen micros to nanos
- We can "safely" convert a TZ type to an NTZ type (discards the timezone info) but not the other way around (wouldn't know which TZ to infer). However, it is a narrowing conversion which the
Variant::as_xxx
methods usually try to avoid. Maybe we shouldn't allow TZ <-> NTZ conversions at all? - But arrow doesn't distinguish physically between TZ and NTZ (that information is embedded in the
DataType
enum variants, not e.g.TimestampMicrosecondType
), so it needs afrom_variant
implementation that works for both. - ArrowTimestampType::make_value looks quite useful, and it takes a
NaiveDateTime
as input. So maybe the correct approach will be to addVariant::as_timestamp[_ntz]_[micros|nanos]
methods, which respectively returnDateTime<Utc>
andNaiveDateTime
, and then makeTimestampFromVariant
generic overconst NTZ: bool
so we can actually have all four implementations we need.
type_name: get_type_name::<T>() | ||
); | ||
|
||
define_variant_to_primitive_builder!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it looks like we do need the TimestampFromVariant
trait, in order to handle UTC vs. NTZ timestamps. See other comment.
parquet-variant/src/variant.rs
Outdated
pub fn as_timestamp_micros(&self) -> Option<i64> { | ||
match *self { | ||
Variant::TimestampMicros(d) => Some(d.timestamp_micros()), | ||
Variant::TimestampNtzMicros(d) => Some(d.and_utc().timestamp_micros()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lossy conversion... we probably need to keep separate normal vs. ntz versions of these methods, where e.g. as_timestamp_micros
is not willing to work with Variant::TimestampNtzMicros
. TBD whether e.g. as_timestamp_ntz_micros
is willing to return a Variant::TimestampMicros
(well-defined but technically lossy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I fully understand this right. If the lossy
here means that we lost the timezone info, yes, it is. The timestamp
here means the physically stored value(with type long) for the NaiveDateTime
and DateTime<Utc>
. If we return Option<NaiveDateTime>/Option<DateTime<Utc>>
separate the ntz and tz versions is a better idea, but when the return value is Option<i64>
(the underlying timestamp long value) then separate or not is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich, thanks for the review. will update the code according to the suggestion.
parquet-variant/src/variant.rs
Outdated
pub fn as_timestamp_micros(&self) -> Option<i64> { | ||
match *self { | ||
Variant::TimestampMicros(d) => Some(d.timestamp_micros()), | ||
Variant::TimestampNtzMicros(d) => Some(d.and_utc().timestamp_micros()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I fully understand this right. If the lossy
here means that we lost the timezone info, yes, it is. The timestamp
here means the physically stored value(with type long) for the NaiveDateTime
and DateTime<Utc>
. If we return Option<NaiveDateTime>/Option<DateTime<Utc>>
separate the ntz and tz versions is a better idea, but when the return value is Option<i64>
(the underlying timestamp long value) then separate or not is the same?
}; | ||
($arrow_type:ty, $( $variant_type:pat => $variant_method:ident, $cast_fn:expr ),+ $(,)?) => { | ||
impl TimestampFromVariant for $arrow_type { | ||
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> { | ||
match variant { | ||
$( | ||
$variant_type => variant.$variant_method().map($cast_fn), | ||
)+ | ||
_ => None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll try with this way.
But the current code also allows invalid conversions, such as interpreting an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too narrow of a waist and lose information.
Does this mean the as_timestamp_xx
itself or the end-to-end of the variant to arrow here? If it's the former, yes, it may be wrong (or maybe we can treat the return value as the physically stored value), if it's the latter, we'll attach the timezone info when initializing the builder
impl_timestamp_from_variant!(TimestampMicrosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros())), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())), | ||
}); | ||
|
||
impl_timestamp_from_variant!(TimestampNanosecondType, { | ||
(Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)), | ||
(Variant::TimestampNanos(t), t.timestamp_nanos_opt()), | ||
(Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can "safely" convert a TZ type to an NTZ type
No, maybe we can't do this, this will lead to the wrong result. The timestamp(the long value) for tz was calculated between the time with 1970-01-01 00:00:00 at +00:00
, and NTZ was calculated between the time with 1970-01-01 00:00:00 in the local timezone
.
But arrow doesn't distinguish physically between TZ and NTZ
IIUC, we don't need to distinguish these two when physically storing the value; they both are the timestamp between now and some time point (1970-01-01 00:00:00 at +00:00 for TZ, and
1970-01-01 00:00:00 in the local timezone` for NTZ )
So maybe the correct approach will be to add Variant::as_timestamp[ntz][micros|nanos] methods,
Separate the tz and ntz version Variant::as_timestamp[ntz][micro|nanos] that returns DateTime<Utc>
and NaiveDateTime
seems a better idea here.
1 extract builder constructing logic to macro_rules 2 add micro -> nano test
…e impl of for timestamp types
84e50c2
to
d1a7d06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich I've updated the code, please take another look, thanks.
datatypes::TimestampMicrosecondType, | ||
as_timestamp_ntz_micros, | ||
ntz = true, | ||
Self::make_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here use Self::make_value
instread of make_value
because I think Self::make_value
is the whole function name, can change if needed
} | ||
|
||
macro_rules! impl_timestamp_from_variant { | ||
($timestamp_type:ty, $variant_method:ident, ntz=$ntz:ident, $cast_fn:expr $(,)?) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, I removed the micro
-> nano
, because I'm not sure how to define the macro_rules
here,
the macro rule in my head now is something like below
($timestamp_type:ty, $variant_method_a:ident, $(opt=$variant_method_b:ident,)? ntz=$ntz:ident, $cast_fn:expr $(,)?) => {
impl TimestampFromVariant<{ $ntz }> for $timestamp_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native> {
let value = variant.$variant_method_a();
if value.is_some() {
return value.and_then($cast_fn);
} else {
#[allow(unused_mut)]
let mut value = None;
$(
value = variant.$variant_method_b();
)?
return value.and_then($cast_fn);
}
}
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The micro -> nano conversion actually happens inside as_timestamp_[ntz_]nanos
, I left a comment there about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found the code and comments and tests in this PR easy to read and understand. Thank you for the great work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Few nits to consider.
/// We can't use [`PrimitiveFromVariant`] directly because we might need to use methods that | ||
/// are only available on [`ArrowTimestampType`] (such as with_timezone_opt) | ||
pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AFAIK, with_timezone_opt
isn't the reason we need this trait -- we could always use the trait bound PrimitiveFromVariant + ArrowTimestampType
to get that. The problem is, we need two implementations for each type -- captured by the NTZ
param here -- which the other trait cannot express
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/variant.rs
Outdated
/// use chrono::NaiveDate; | ||
/// | ||
/// // you can extract a DateTime<Utc> from a UTC-adjusted variant | ||
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_nano_opt(12, 34, 56, 789123456).unwrap().and_utc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: really long line...
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_nano_opt(12, 34, 56, 789123456).unwrap().and_utc(); | |
/// let datetime = NaiveDate::from_ymd_opt(2025, 4, 16) | |
/// .unwrap() | |
/// .and_hms_nano_opt(12, 34, 56, 789123456) | |
/// .unwrap() | |
/// .and_utc(); |
(several others in the doc tests)
parquet-variant/src/variant.rs
Outdated
/// ``` | ||
pub fn as_timestamp_nanos(&self) -> Option<DateTime<Utc>> { | ||
match *self { | ||
Variant::TimestampNanos(d) => Some(d), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variant::TimestampNanos(d) => Some(d), | |
Variant::TimestampNanos(d) | Variant::TimestampMicros(d) => Some(d), |
(they both store DateTime<Utc>
)
(again below for ntz nanos)
} | ||
|
||
macro_rules! impl_timestamp_from_variant { | ||
($timestamp_type:ty, $variant_method:ident, ntz=$ntz:ident, $cast_fn:expr $(,)?) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The micro -> nano conversion actually happens inside as_timestamp_[ntz_]nanos
, I left a comment there about it.
|
||
/// Extension trait that `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime` | ||
trait MakeValueTz: ArrowTimestampType { | ||
fn make_value_tz(timestamp: DateTime<Utc>) -> Option<i64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this name is fine, but because it's a different trait we should also be able to "overload" make_value
if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for didn't add a comment for this, use make_value_tz
here because it can't compile if I try to "overload" make_value
, the compiler needs me to convert type to a specific type in MakeValueTz::make_value
and macro impl_timestamp_from_variant
(something like <Self as ArrowTimestampType>::make_value(...)
and <Self as MakeValueTz>::make_value(...)
. and googled that seems Rust didn't support "overload" with same func name and different parameters. not sure if I missed anything here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich I've addressed most of your comments, and reply for the MakeValueTz::make_value_tz
name, please take another look, thanks.
|
||
/// Extension trait that `ArrowTimestampType` handle `DateTime<Utc>` like `NaiveDateTime` | ||
trait MakeValueTz: ArrowTimestampType { | ||
fn make_value_tz(timestamp: DateTime<Utc>) -> Option<i64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for didn't add a comment for this, use make_value_tz
here because it can't compile if I try to "overload" make_value
, the compiler needs me to convert type to a specific type in MakeValueTz::make_value
and macro impl_timestamp_from_variant
(something like <Self as ArrowTimestampType>::make_value(...)
and <Self as MakeValueTz>::make_value(...)
. and googled that seems Rust didn't support "overload" with same func name and different parameters. not sure if I missed anything here
/// We can't use [`PrimitiveFromVariant`] directly because we might need to use methods that | ||
/// are only available on [`ArrowTimestampType`] (such as with_timezone_opt) | ||
pub(crate) trait TimestampFromVariant<const NTZ: bool>: ArrowTimestampType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Which issue does this PR close?
What changes are included in this PR?
define_variant_to_primitive_builder
used to construct the variant to x arrow row builderVariantToBooleanArrowRowBuilder
/VariantToPrimitiveArrowRowBuilder
/VariantToTimestampArrowRowBuilder
using the macrodefine_variant_to_primitive_builder
Variant::Timestamp/Time
(timestamp will automatic widen micros to nanos)Variant::{int8/float32/float64}
for existing codeAre these changes tested?
Added tests
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.