You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto detect hive column partitioning with ListingTableFactory / CREATE EXTERNAL TABLE (#17232)
* Fix: ListingTableFactory hive column detection
- Fixes an issue in the ListingTableFactory where hive columns are not
detected and incorporated into the table schema when an explicit
schema has not been set by the user
- Fixes an issue where subdirectories that do not follow Hive
formatting (e.g. key=value) could be erroneously interpreted as
contributing to the table schema
* Adds configuration, tests, and docs
- Adds a configuration option to enable or disable hive partition
schema inference
- Adds configuration option documentation and unit tests
- Adds additional sqllogic tests specifically targeting partitioned
listing tables
- Adds user guide docs for migration and external table behavior for
both the CLI and DDL guides
* Fix merge problem
* Update slt test
* Make upgrade guide more concise
* Fixes spelling and doc table reference issues
---------
Co-authored-by: Andrew Lamb <[email protected]>
@@ -334,6 +335,7 @@ datafusion.execution.collect_statistics true Should DataFusion collect statistic
334
335
datafusion.execution.enable_recursive_ctes true Should DataFusion support recursive CTEs
335
336
datafusion.execution.enforce_batch_size_in_joins false Should DataFusion enforce batch size in joins or not. By default, DataFusion will not enforce batch size in joins. Enforcing batch size in joins can reduce memory usage when joining large tables with a highly-selective join filter, but is also slightly slower.
336
337
datafusion.execution.keep_partition_by_columns false Should DataFusion keep the columns used for partition_by in the output RecordBatches
338
+
datafusion.execution.listing_table_factory_infer_partitions true Should a `ListingTable` created through the `ListingTableFactory` infer table partitions from Hive compliant directories. Defaults to true (partition columns are inferred and will be represented in the table schema).
337
339
datafusion.execution.listing_table_ignore_subdirectory true Should sub directories be ignored when scanning directories for data files. Defaults to true (ignores subdirectories), consistent with Hive. Note that this setting does not affect reading partitioned tables (e.g. `/table/year=2021/month=01/data.parquet`).
338
340
datafusion.execution.max_buffered_batches_per_output_file 2 This is the maximum number of RecordBatches buffered for each output file being worked. Higher values can potentially give faster write performance at the cost of higher peak memory consumption
339
341
datafusion.execution.meta_fetch_concurrency 32 Number of files to read in parallel when inferring schema and statistics
Copy file name to clipboardExpand all lines: docs/source/library-user-guide/upgrading.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,17 @@
24
24
**Note:** DataFusion `50.0.0` has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.
25
25
You can see the current [status of the `50.0.0 `release here](https://github.com/apache/datafusion/issues/16799)
0 commit comments