-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Remove SchemaAdapter #19345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove SchemaAdapter #19345
Conversation
e21ca2e to
7d8572a
Compare
|
|
||
| See the [default column values example](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/custom_data_source/default_column_values.rs) for how to implement a custom `PhysicalExprAdapterFactory`. | ||
|
|
||
| ### `SchemaAdapter` and `SchemaAdapterFactory` completely removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new paragraph overlaps with the previous one - ### SchemaAdapterFactory Fully Removed from Parquet
Maybe they should be merged into one ?!
| println!("4. Default values from metadata are cast to proper types at planning time"); | ||
| println!("5. The DefaultPhysicalExprAdapter handles other schema adaptations"); | ||
| println!("\nNote: PhysicalExprAdapter is specifically for filter predicates."); | ||
| println!("For projection columns, different mechanisms handle missing columns."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line be removed/edited ?
https://github.com/apache/datafusion/pull/19345/changes#diff-dd8ef704e14ac0794362f6dd9b356468e45d3ea33be15708ec2f339b5a0fdb72R67 says that projection expressions (note: expressions vs. columns) is also covered by PhysicalExprAdapter
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb
I think this PR is good enough to go. I do think it would be worth porting some of the tests as well to .slt files and @martin-g 's suggestion to consolidate the upgrade guide
But I think it is important to close out the expression rewriting changes in DataFusion 52 and get this done
| } | ||
| } | ||
|
|
||
| /// Test reading and filtering a Parquet file where the table schema is flipped (c, b, a) vs. the physical file schema (a, b, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these scenarios covered elsewhere? I feel like (now) we could write these all as .slt tests
Looks like some of it is covered here:
| - `SchemaMapping` struct | ||
| - `DefaultSchemaAdapterFactory` struct | ||
|
|
||
| These types were previously used to adapt record batch schemas during file reading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to cause non trivial pain for anyone who uses the SchemaAdapter during upgrade
However, I am not sure if leaving the code in but disconnected would be any better.
Thus I think we should go with this PR and we can help with some more writeups when we start testing the upgrade with downstream crates (like delta.rs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I leave them in, mark them as deprecated and have them raise a runtime error with a link to the upgrading guide? At least then it doesn't fail silently at runtime. You'd have to ignore the compile time warnings as well.
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn with_schema_adapter_factory( | ||
| &self, | ||
| _factory: Arc<dyn SchemaAdapterFactory>, | ||
| ) -> Result<Arc<dyn FileSource>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileSource trait no longer has with_schema_adapter_factory() and schema_adapter_factory() methods.
For users, this means there's now no way for a custom FileSource to influence schema adaptation behavior at the file-source level. The only knob is at FileScanConfig / ListingTableConfig, which is downstream.
This is a capability reduction that should be called out in migration docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of a use case for these trait methods? Who is using a custom FileSource without using FileScanConfig? Why can't they attach the adapter to the concrete type, i.e. why does it have to be part of the trait? As far as I could see these methods were only being used by our own tests.
| /// # fn with_schema_adapter_factory(&self, factory: Arc<dyn SchemaAdapterFactory>) -> Result<Arc<dyn FileSource>> { Ok(Arc::new(Self {table_schema: self.table_schema.clone(), schema_adapter_factory: Some(factory)} )) } | ||
| /// # fn schema_adapter_factory(&self) -> Option<Arc<dyn SchemaAdapterFactory>> { self.schema_adapter_factory.clone() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change that is correct but needs clearer deprecation messaging. If users were relying on with_schema_adapter_factory(), they now have a compile error.
Adding a deprecated attribute that points to the upgrade guide would help users migrate.
| async fn test_parquet_flipped_projection() -> Result<()> { | ||
| // Create test data with columns (a, b, c) - the file schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a replacement end-to-end integration test for column reordering in Parquet scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we have several but I added an slt version in 2f5ff2e
| async fn test_multi_source_schema_adapter_reuse() -> Result<()> { | ||
| // This test verifies that the same schema adapter factory can be reused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important for ListingTable.
A test for ListingTable would add assurance that the functionality is retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 2f5ff2e
ca5b180 to
a39814a
Compare
7f617f9 to
3c62043
Compare
|
I think I've addressed all of the feedback except maybe #19345 (comment). Since I've already had to resolve conflicts a couple times and it seems this would hold up a release I'd like to merge this approved PR. @kosiew is that concern a blocker or can we discuss more and address as a followup? |
fd0d2be to
5b667dc
Compare
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM

Closes #16800
We could leave some of these methods around as deprecated and make them no-ops but I'd be afraid that would create a false sense of security (compiles but behaves wrong at runtime).