-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: Make join selection configurable through enable_*
and join_method_priority
options
#17467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces configurable join selection in DataFusion through new configuration options that separate the concepts of "enabling" and "preferring" specific join algorithms. It changes the join planning behavior from a simple boolean toggle to a more flexible system where users can control which join types are available and which are preferred.
Key changes:
- Added new configuration options for enabling and preferring each join type (HashJoin, SortMergeJoin, NestedLoopJoin)
- Moved join selection logic to a dedicated
join_planner.rs
module - Changed the default and semantics of
prefer_hash_join
fromtrue
tofalse
(breaking change)
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
docs/source/user-guide/configs.md | Updated documentation for new join configuration options |
datafusion/core/src/physical_planner/join_planner.rs | New module containing configurable join selection logic |
datafusion/core/src/physical_planner.rs | Refactored to use new join planner module |
datafusion/common/src/config.rs | Added new configuration options for join enablement and preferences |
Multiple test files | Updated tests to use new configuration semantics |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
datafusion/common/src/config.rs
Outdated
/// HashJoin can work more efficiently than SortMergeJoin but consumes more memory | ||
pub prefer_hash_join: bool, default = true | ||
pub prefer_hash_join: bool, default = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the default value from true
to false
is a breaking change that should be clearly documented in the migration guide. Consider if this default change aligns with the project's backward compatibility policy.
pub prefer_hash_join: bool, default = false | |
pub prefer_hash_join: bool, default = true |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a breaking change. By default there is no join types preferred, and the planner will prioritize HJ according to heuristics, setting it to true
will just override the planner's default behavior.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01
I think the names configuration settings are somewhat confusing. Specifically what does "prefer_hash_join=true" AND "prefer_sort_merge_join=true" mean? when reading the code I found that HJ > SMJ > NLJ
It also seems somewhat overlapping with the "join selection" optimizer: https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs
Given the increasing interest in improving Joins in DataFusion, I wonder if now is the time to create some space / a structure for more sophisticated join planners instead of making the existing one more complicated. In particular, I think the join algorithm is just one part of a more sophisticated strategy for joins (that also may reorders joins, for example)
Maybe we could make JoinPlanner
a trait that can be registered with the SessionContext or the Optimizer the same way as ExtensionPLanners?
Then we can provide a default JoinPlanner (what currently exists) that has its own config namespace, etc
trait JoinPlanner {
// plan the initial join when converting from Logical --> Physical join
fn plan_initial_join(
session_state: &SessionState,
physical_left: Arc<dyn ExecutionPlan>,
physical_right: Arc<dyn ExecutionPlan>,
join_on: join_utils::JoinOn,
join_filter: Option<join_utils::JoinFilter>,
join_type: &JoinType,
null_equality: &datafusion_common::NullEquality,) -> Arc<dyn ExecutionPlan>;
}
preferred.push(Algo::Nlj); | ||
} | ||
|
||
// Helper to pick by priority HJ > SMJ > NLJ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an important detail (the relative priorities) of the different join algorithms that we should probably communicate to the user somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in the config manual
/// by checking if they're enabled by options like `datafusion.optimizer.enable_hash_join` | ||
/// - Step 3: Choose one according to the built-in heuristics and also the preference | ||
/// in the configuration, e.g. `datafusion.optimizer.prefer_hash_join` | ||
pub(super) fn plan_join_exec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be called "plan_initial_join_exec"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@alamb Thanks for the suggestions:
|
I think that would be better than three distinct settings. |
enable_*
and prefer_*
optionsenable_*
and join_method_priority*
options
enable_*
and join_method_priority*
optionsenable_*
and join_method_priority
options
I think this trait should include two major steps from the current implementation:
Now I think it's a bit hard to extract them into a pluggable module, because they seem tightly coupled with other optimizer rules. Perhaps we can give it a try when there are multiple radically different join planning/reordering strategies available — we'll have a better understanding of how this interface should look by then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Which issue does this PR close?
Rationale for this change
See the issue for details.
Update: now it's using
join_method_priority
instead of individualprefer_*
options in the issue.set datafusion.optimizer.join_method_priority = 'hj, nlj'
User can specify a comma separated list, with any number of existing join method, and the planner will try from the first one to the last, and finally pick the first one that is both enabled (through config like
enable_hash_join
), and also applicable for the given join logical plan node.This PR provides a common framework for physical join type selection. If more join types are added in the future (e.g.,
PiecewiseMergeJoin
from @jonathanc-n 's great ongoing work), related configurations can be added more easily to control the planning.What changes are included in this PR?
physical_planner.rs
intojoin_planner.rs
, and update it to follow flags added in step 1. No existing planning logic has changed.slt
datafusion.optimizer.prefer_hash_join
, and update tests to use the new APIjoin_method_priority
. For backwards compatibility, whenjoin_method_priority
is set to empty string (it defaults to empty also),prefer_hash_join
option has the same semantics as before; otherwise if it's not empty, it overridesprefer_hash_join
Are these changes tested?
Yes, see above.
Are there any user-facing changes?
when
join_method_priority
is set,datafusion.optimizer.prefer_hash_join
will be ignored.