Skip to content

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Sep 8, 2025

Which issue does this PR close?

Rationale for this change

See the issue for details.

Update: now it's using join_method_priority instead of individual prefer_* options in the issue.

set datafusion.optimizer.join_method_priority = 'hj, nlj'
User can specify a comma separated list, with any number of existing join method, and the planner will try from the first one to the last, and finally pick the first one that is both enabled (through config like enable_hash_join), and also applicable for the given join logical plan node.

This PR provides a common framework for physical join type selection. If more join types are added in the future (e.g., PiecewiseMergeJoin from @jonathanc-n 's great ongoing work), related configurations can be added more easily to control the planning.

What changes are included in this PR?

  1. Add configuration options to make join type selection configurable.
  2. Move the join type selection logic from physical_planner.rs into join_planner.rs, and update it to follow flags added in step 1. No existing planning logic has changed.
  3. Tested different flag combinations through a slt
  4. Deprecate datafusion.optimizer.prefer_hash_join, and update tests to use the new API join_method_priority. For backwards compatibility, when join_method_priority is set to empty string (it defaults to empty also), prefer_hash_join option has the same semantics as before; otherwise if it's not empty, it overrides prefer_hash_join

Are these changes tested?

Yes, see above.

Are there any user-facing changes?

when join_method_priority is set, datafusion.optimizer.prefer_hash_join will be ignored.

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Sep 8, 2025
@2010YOUY01 2010YOUY01 requested a review from Copilot September 8, 2025 06:04
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces configurable join selection in DataFusion through new configuration options that separate the concepts of "enabling" and "preferring" specific join algorithms. It changes the join planning behavior from a simple boolean toggle to a more flexible system where users can control which join types are available and which are preferred.

Key changes:

  • Added new configuration options for enabling and preferring each join type (HashJoin, SortMergeJoin, NestedLoopJoin)
  • Moved join selection logic to a dedicated join_planner.rs module
  • Changed the default and semantics of prefer_hash_join from true to false (breaking change)

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/source/user-guide/configs.md Updated documentation for new join configuration options
datafusion/core/src/physical_planner/join_planner.rs New module containing configurable join selection logic
datafusion/core/src/physical_planner.rs Refactored to use new join planner module
datafusion/common/src/config.rs Added new configuration options for join enablement and preferences
Multiple test files Updated tests to use new configuration semantics

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

/// HashJoin can work more efficiently than SortMergeJoin but consumes more memory
pub prefer_hash_join: bool, default = true
pub prefer_hash_join: bool, default = false
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default value from true to false is a breaking change that should be clearly documented in the migration guide. Consider if this default change aligns with the project's backward compatibility policy.

Suggested change
pub prefer_hash_join: bool, default = false
pub prefer_hash_join: bool, default = true

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a breaking change. By default there is no join types preferred, and the planner will prioritize HJ according to heuristics, setting it to true will just override the planner's default behavior.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01

I think the names configuration settings are somewhat confusing. Specifically what does "prefer_hash_join=true" AND "prefer_sort_merge_join=true" mean? when reading the code I found that HJ > SMJ > NLJ

It also seems somewhat overlapping with the "join selection" optimizer: https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs

Given the increasing interest in improving Joins in DataFusion, I wonder if now is the time to create some space / a structure for more sophisticated join planners instead of making the existing one more complicated. In particular, I think the join algorithm is just one part of a more sophisticated strategy for joins (that also may reorders joins, for example)

Maybe we could make JoinPlanner a trait that can be registered with the SessionContext or the Optimizer the same way as ExtensionPLanners?

Then we can provide a default JoinPlanner (what currently exists) that has its own config namespace, etc

trait JoinPlanner {
  // plan the initial join when converting from Logical --> Physical join
  fn plan_initial_join(
    session_state: &SessionState,
    physical_left: Arc<dyn ExecutionPlan>,
    physical_right: Arc<dyn ExecutionPlan>,
    join_on: join_utils::JoinOn,
    join_filter: Option<join_utils::JoinFilter>,
    join_type: &JoinType,
    null_equality: &datafusion_common::NullEquality,) -> Arc<dyn ExecutionPlan>;
}


preferred.push(Algo::Nlj);
}

// Helper to pick by priority HJ > SMJ > NLJ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an important detail (the relative priorities) of the different join algorithms that we should probably communicate to the user somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in the config manual

/// by checking if they're enabled by options like `datafusion.optimizer.enable_hash_join`
/// - Step 3: Choose one according to the built-in heuristics and also the preference
/// in the configuration, e.g. `datafusion.optimizer.prefer_hash_join`
pub(super) fn plan_join_exec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be called "plan_initial_join_exec"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@2010YOUY01
Copy link
Contributor Author

@alamb Thanks for the suggestions:

  • Regarding the ambiguous prefer_* setting, WDYT using a single config instead: join_method_priority = 'hj, smj, nlj' instead?
  • I'll think more about this JoinPlanner idea later. I think making join ordering configurable is important too. What I’m planning to do next is to add a configuration option to disable the built-in join reordering, as the simplest join order option . Let me think about how to incorporate that into JoinPlanner.

@alamb
Copy link
Contributor

alamb commented Sep 10, 2025

  • Regarding the ambiguous prefer_* setting, WDYT using a single config instead: join_method_priority = 'hj, smj, nlj' instead?

I think that would be better than three distinct settings.

@2010YOUY01 2010YOUY01 marked this pull request as draft September 11, 2025 10:12
@2010YOUY01 2010YOUY01 changed the title feat: Make join selection configurable through enable_* and prefer_* options feat: Make join selection configurable through enable_* and join_method_priority* options Sep 13, 2025
@2010YOUY01 2010YOUY01 changed the title feat: Make join selection configurable through enable_* and join_method_priority* options feat: Make join selection configurable through enable_* and join_method_priority options Sep 13, 2025
@2010YOUY01 2010YOUY01 marked this pull request as ready for review September 13, 2025 15:24
@2010YOUY01
Copy link
Contributor Author

Given the increasing interest in improving Joins in DataFusion, I wonder if now is the time to create some space / a structure for more sophisticated join planners instead of making the existing one more complicated. In particular, I think the join algorithm is just one part of a more sophisticated strategy for joins (that also may reorders joins, for example)

Maybe we could make JoinPlanner a trait that can be registered with the SessionContext or the Optimizer the same way as ExtensionPLanners?

Then we can provide a default JoinPlanner (what currently exists) that has its own config namespace, etc

trait JoinPlanner {
  // plan the initial join when converting from Logical --> Physical join
  fn plan_initial_join(
    session_state: &SessionState,
    physical_left: Arc<dyn ExecutionPlan>,
    physical_right: Arc<dyn ExecutionPlan>,
    join_on: join_utils::JoinOn,
    join_filter: Option<join_utils::JoinFilter>,
    join_type: &JoinType,
    null_equality: &datafusion_common::NullEquality,) -> Arc<dyn ExecutionPlan>;
}

I think this trait should include two major steps from the current implementation:

  1. plan_initia_join method: it converts the join logical plan to initial physical plan, which decides which join method to use (hash join or NLJ) for join nodes
  2. JoinSelection optimizer rule: it further refine the initial physical plan (e.g. change the repartition strategy inside Hash Join's initial physical plan), and also swaps join inputs according to stats

Now I think it's a bit hard to extract them into a pluggable module, because they seem tightly coupled with other optimizer rules. Perhaps we can give it a try when there are multiple radically different join planning/reordering strategies available — we'll have a better understanding of how this interface should look by then.

@2010YOUY01 2010YOUY01 requested a review from Copilot September 13, 2025 15:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configuration to choose specific join implementation
2 participants