-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Field
methods to clear extension types, remove experimental
warning from extension
module
#8350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @mbrobbel
THANK YOU! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
#[inline] | ||
pub fn set_data_type(&mut self, data_type: DataType) { | ||
self.data_type = data_type; | ||
self.clear_extension_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially worried that this would obliterate metadata that previously had been propagated, but I couldn't find any existing usage of this function (I don't even think it's tested in arrow-rs) and so I think it's probably not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nobody actually uses set_data_type
, can we deprecate and remove it?
It's kind of a weird API surface, especially with the new extension type semantics
/// ``` | ||
pub fn with_data_type(mut self, data_type: DataType) -> Self { | ||
self.set_data_type(data_type); | ||
self.without_extension_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed this.
This will probably obliterate metadata in some places. There's a bunch of code in datafusion that looks like this:
let new_fields = fields
.into_iter()
.zip(new_data_types)
.map(|(f, d)| f.as_ref().clone().with_data_type(d))
.map(Arc::new)
.collect::<Vec<FieldRef>>();
...where d
is often unchanged but the outer field may have contained metadata. Sometimes the data type change is compatible with an extension type (e.g., reconciling String and StringViews), sometimes it is not (e.g., if there is any code that reconciles FixedSizeBinary and Binary inputs, which there may not be).
Using a Field
as a data type is a constant whack-a-mole of these types of issues but I think this might be safer propagating the extension metadata until downstream projects have a chance to move away from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems really dangerous... it will silently strip away the extension type information, which can propagate into e.g. the parquet file the column eventually gets written to, which could be very hard to fix later on when downstream readers start noticing the extension type info is unexpectedly missing.
Seems better to make this fallible (with_data_type
becomes try_with_data_type
), and blow up if the new data type is incompatible with the currently-installed extension type?
We could potentially still keep the original (infallible) version and panic if the new type is invalid. Doing that means people upgrading to arrow-57 won't necessarily realize there's a potential problem until their code starts crashing, which I don't love, but I suspect some people would rather deal with a potential panic than have to rewrite all their with_data_type
calls?
Alternatively -- maybe we leave the panicky infallible version in place, but deprecated with an appropriate warning? But people with automated builds won't necessarily see the warning, especially if they already have lots of warnings in their code.
If somebody actually wanted to clear the extension type, it seems much better to require them to clearly state their intent by e.g. field.without_extension_type().with_data_type(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is not a good solution. It would be nice to prevent invalid state via the API of Field
, however, with set_metadata
and metadata_mut
it's already possible to set datatype incompatible extension type information, so either we accept it (and replace the clear calls added in this PR with a warning in the docs) or we reconsider the mutating methods of the Field
API.
Seems better to make this fallible (with_data_type becomes try_with_data_type), and blow up if the new data type is incompatible with the currently-installed extension type?
The problem is you can't check data type compatibility without the actual extension type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clear documentation warnings are best because (as @mbrobbel says) you don't know what combination is valid without the actual extension type.
Perhaps we can also add some examples of "how to change datatype and still validate extension types" or something to make it easier for downstream users to verify the expected extension types are still valid after modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is indeed awkward that we don't have any easy way to map from a random extension name string to the actual type :(
I don't think we need both? The current implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a better approach than silently and blindly clearing the extension type info? See comment.
/// Set [`DataType`] of the [`Field`] and returns self. | ||
/// | ||
/// This clears extension type information from the returned field by | ||
/// calling [`Field::without_extension_type`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Set [`DataType`] of the [`Field`] and returns self. | |
/// | |
/// This clears extension type information from the returned field by | |
/// calling [`Field::without_extension_type`]. | |
/// Returns a new field with the requested [`DataType`]. | |
/// | |
/// Extension type information (if any) is not propagated. |
/// Set [`DataType`] of the [`Field`]. | ||
/// | ||
/// This clears extension type information from this field by calling | ||
/// [`Field::clear_extension_type`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Set [`DataType`] of the [`Field`]. | |
/// | |
/// This clears extension type information from this field by calling | |
/// [`Field::clear_extension_type`]. | |
/// Set [`DataType`] of the [`Field`], clearing any extension type information in the process. |
#[inline] | ||
pub fn set_data_type(&mut self, data_type: DataType) { | ||
self.data_type = data_type; | ||
self.clear_extension_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nobody actually uses set_data_type
, can we deprecate and remove it?
It's kind of a weird API surface, especially with the new extension type semantics
/// ``` | ||
pub fn with_data_type(mut self, data_type: DataType) -> Self { | ||
self.set_data_type(data_type); | ||
self.without_extension_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems really dangerous... it will silently strip away the extension type information, which can propagate into e.g. the parquet file the column eventually gets written to, which could be very hard to fix later on when downstream readers start noticing the extension type info is unexpectedly missing.
Seems better to make this fallible (with_data_type
becomes try_with_data_type
), and blow up if the new data type is incompatible with the currently-installed extension type?
We could potentially still keep the original (infallible) version and panic if the new type is invalid. Doing that means people upgrading to arrow-57 won't necessarily realize there's a potential problem until their code starts crashing, which I don't love, but I suspect some people would rather deal with a potential panic than have to rewrite all their with_data_type
calls?
Alternatively -- maybe we leave the panicky infallible version in place, but deprecated with an appropriate warning? But people with automated builds won't necessarily see the warning, especially if they already have lots of warnings in their code.
If somebody actually wanted to clear the extension type, it seems much better to require them to clearly state their intent by e.g. field.without_extension_type().with_data_type(...)
|
||
//! Extension types. | ||
//! | ||
//! <div class="warning">This module is experimental. There might be breaking changes between minor releases.</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Which issue does this PR close?
extension
types fully supported (not experimental) #8030.Rationale for this change
As discussed in the linked issue we want to remove the
experimental
warning from theextension
module.When adding extension type information via the
Field
methods,ExtensionType::supports_datatype
is used to validate theField
datatype is compatible with the given extension type. When users change theField
datatype, it's not clear if this is still true, so this PR proposes to remove the extension type information when changing datatype information.What changes are included in this PR?
Adds two
Field
methods:Field::clear_extension_type
Field::without_extension_type
These methods are used when changing data types via
Field::set_data_type
andField::with_data_type
.Are these changes tested?
Yes, CI.
Are there any user-facing changes?
Yes,
Field::set_data_type
andField::with_data_type
now clear extension type information from theField
metadata.