feat: Make join selection configurable through `enable_*` and `join_method_priority` options #17467

2010YOUY01 · 2025-09-08T06:03:51Z

Which issue does this PR close?

Closes Add configuration to choose specific join implementation #17432

Rationale for this change

See the issue for details.

Update: now it's using join_method_priority instead of individual prefer_* options in the issue.

set datafusion.optimizer.join_method_priority = 'hj, nlj'
User can specify a comma separated list, with any number of existing join method, and the planner will try from the first one to the last, and finally pick the first one that is both enabled (through config like enable_hash_join), and also applicable for the given join logical plan node.

This PR provides a common framework for physical join type selection. If more join types are added in the future (e.g., PiecewiseMergeJoin from @jonathanc-n 's great ongoing work), related configurations can be added more easily to control the planning.

What changes are included in this PR?

Add configuration options to make join type selection configurable.
Move the join type selection logic from physical_planner.rs into join_planner.rs, and update it to follow flags added in step 1. No existing planning logic has changed.
Tested different flag combinations through a slt
Deprecate datafusion.optimizer.prefer_hash_join, and update tests to use the new API join_method_priority. For backwards compatibility, when join_method_priority is set to empty string (it defaults to empty also), prefer_hash_join option has the same semantics as before; otherwise if it's not empty, it overrides prefer_hash_join

Are these changes tested?

Yes, see above.

Are there any user-facing changes?

when join_method_priority is set, datafusion.optimizer.prefer_hash_join will be ignored.

…tions

Copilot

Pull Request Overview

This PR introduces configurable join selection in DataFusion through new configuration options that separate the concepts of "enabling" and "preferring" specific join algorithms. It changes the join planning behavior from a simple boolean toggle to a more flexible system where users can control which join types are available and which are preferred.

Key changes:

Added new configuration options for enabling and preferring each join type (HashJoin, SortMergeJoin, NestedLoopJoin)
Moved join selection logic to a dedicated join_planner.rs module
Changed the default and semantics of prefer_hash_join from true to false (breaking change)

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
docs/source/user-guide/configs.md	Updated documentation for new join configuration options
datafusion/core/src/physical_planner/join_planner.rs	New module containing configurable join selection logic
datafusion/core/src/physical_planner.rs	Refactored to use new join planner module
datafusion/common/src/config.rs	Added new configuration options for join enablement and preferences
Multiple test files	Updated tests to use new configuration semantics

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

datafusion/core/src/physical_planner/join_planner.rs

Copilot · 2025-09-08T06:04:45Z

datafusion/common/src/config.rs

        /// HashJoin can work more efficiently than SortMergeJoin but consumes more memory
-        pub prefer_hash_join: bool, default = true
+        pub prefer_hash_join: bool, default = false


Changing the default value from true to false is a breaking change that should be clearly documented in the migration guide. Consider if this default change aligns with the project's backward compatibility policy.

Suggested change

pub prefer_hash_join: bool, default = false

pub prefer_hash_join: bool, default = true

It's not a breaking change. By default there is no join types preferred, and the planner will prioritize HJ according to heuristics, setting it to true will just override the planner's default behavior.

Co-authored-by: Copilot <[email protected]>

alamb

Thank you @2010YOUY01

I think the names configuration settings are somewhat confusing. Specifically what does "prefer_hash_join=true" AND "prefer_sort_merge_join=true" mean? when reading the code I found that HJ > SMJ > NLJ

It also seems somewhat overlapping with the "join selection" optimizer: https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs

Given the increasing interest in improving Joins in DataFusion, I wonder if now is the time to create some space / a structure for more sophisticated join planners instead of making the existing one more complicated. In particular, I think the join algorithm is just one part of a more sophisticated strategy for joins (that also may reorders joins, for example)

Maybe we could make JoinPlanner a trait that can be registered with the SessionContext or the Optimizer the same way as ExtensionPLanners?

Then we can provide a default JoinPlanner (what currently exists) that has its own config namespace, etc

trait JoinPlanner {
  // plan the initial join when converting from Logical --> Physical join
  fn plan_initial_join(
    session_state: &SessionState,
    physical_left: Arc<dyn ExecutionPlan>,
    physical_right: Arc<dyn ExecutionPlan>,
    join_on: join_utils::JoinOn,
    join_filter: Option<join_utils::JoinFilter>,
    join_type: &JoinType,
    null_equality: &datafusion_common::NullEquality,) -> Arc<dyn ExecutionPlan>;
}

alamb · 2025-09-09T13:17:20Z

datafusion/core/src/physical_planner/join_planner.rs

+        preferred.push(Algo::Nlj);
+    }
+
+    // Helper to pick by priority HJ > SMJ > NLJ


this is an important detail (the relative priorities) of the different join algorithms that we should probably communicate to the user somehow

added in the config manual

alamb · 2025-09-09T13:24:12Z

datafusion/core/src/physical_planner/join_planner.rs

+///   by checking if they're enabled by options like `datafusion.optimizer.enable_hash_join`
+/// - Step 3: Choose one according to the built-in heuristics and also the preference
+///   in the configuration, e.g. `datafusion.optimizer.prefer_hash_join`
+pub(super) fn plan_join_exec(


Maybe this could be called "plan_initial_join_exec"

2010YOUY01 · 2025-09-09T16:48:02Z

@alamb Thanks for the suggestions:

Regarding the ambiguous prefer_* setting, WDYT using a single config instead: join_method_priority = 'hj, smj, nlj' instead?
I'll think more about this JoinPlanner idea later. I think making join ordering configurable is important too. What I’m planning to do next is to add a configuration option to disable the built-in join reordering, as the simplest join order option . Let me think about how to incorporate that into JoinPlanner.

alamb · 2025-09-10T20:09:31Z

Regarding the ambiguous prefer_* setting, WDYT using a single config instead: join_method_priority = 'hj, smj, nlj' instead?

I think that would be better than three distinct settings.

2010YOUY01 · 2025-09-13T15:32:47Z

Given the increasing interest in improving Joins in DataFusion, I wonder if now is the time to create some space / a structure for more sophisticated join planners instead of making the existing one more complicated. In particular, I think the join algorithm is just one part of a more sophisticated strategy for joins (that also may reorders joins, for example)

Maybe we could make JoinPlanner a trait that can be registered with the SessionContext or the Optimizer the same way as ExtensionPLanners?

Then we can provide a default JoinPlanner (what currently exists) that has its own config namespace, etc
trait JoinPlanner {
  // plan the initial join when converting from Logical --> Physical join
  fn plan_initial_join(
    session_state: &SessionState,
    physical_left: Arc<dyn ExecutionPlan>,
    physical_right: Arc<dyn ExecutionPlan>,
    join_on: join_utils::JoinOn,
    join_filter: Option<join_utils::JoinFilter>,
    join_type: &JoinType,
    null_equality: &datafusion_common::NullEquality,) -> Arc<dyn ExecutionPlan>;
}

I think this trait should include two major steps from the current implementation:

plan_initia_join method: it converts the join logical plan to initial physical plan, which decides which join method to use (hash join or NLJ) for join nodes
JoinSelection optimizer rule: it further refine the initial physical plan (e.g. change the repartition strategy inside Hash Join's initial physical plan), and also swaps join inputs according to stats

Now I think it's a bit hard to extract them into a pluggable module, because they seem tightly coupled with other optimizer rules. Perhaps we can give it a try when there are multiple radically different join planning/reordering strategies available — we'll have a better understanding of how this interface should look by then.

Copilot

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

datafusion/common/src/config.rs

datafusion/core/src/physical_planner/join_planner.rs

2010YOUY01 added 2 commits September 8, 2025 13:54

Make join selection configurable through enable_* and prefer_* op…

9b7ac55

…tions

clean up

bde1a65

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Sep 8, 2025

2010YOUY01 requested a review from Copilot September 8, 2025 06:04

Copilot AI reviewed Sep 8, 2025

View reviewed changes

2010YOUY01 and others added 4 commits September 8, 2025 14:05

Update datafusion/core/src/physical_planner/join_planner.rs

521e51b

Co-authored-by: Copilot <[email protected]>

fix clippy

abe8e2a

fix test

941be55

fix fmt

6043716

alamb reviewed Sep 9, 2025

View reviewed changes

2010YOUY01 marked this pull request as draft September 11, 2025 10:12

2010YOUY01 mentioned this pull request Sep 11, 2025

feat: ClassicJoin for PWMJ #17482

Open

2010YOUY01 added 6 commits September 13, 2025 19:55

use join_method_priority instead

3bf325b

fix typo

498077f

clean up

73a9cd5

fix test

71e0f90

fix test

3a1471f

review

00f6dbb

2010YOUY01 changed the title ~~feat: Make join selection configurable through enable_* and prefer_* options~~ feat: Make join selection configurable through enable_* and join_method_priority* options Sep 13, 2025

2010YOUY01 changed the title ~~feat: Make join selection configurable through enable_* and join_method_priority* options~~ feat: Make join selection configurable through enable_* and join_method_priority options Sep 13, 2025

2010YOUY01 marked this pull request as ready for review September 13, 2025 15:24

2010YOUY01 requested a review from Copilot September 13, 2025 15:34

Copilot AI reviewed Sep 13, 2025

View reviewed changes

datafusion/common/src/config.rs Show resolved Hide resolved

datafusion/core/src/physical_planner/join_planner.rs Show resolved Hide resolved

datafusion/core/src/physical_planner/join_planner.rs Show resolved Hide resolved

2010YOUY01 added 3 commits September 14, 2025 16:39

bot review

b659c61

revert suggestion from AI -- it's not true

2323ef7

fix

415dac5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Make join selection configurable through `enable_*` and `join_method_priority` options #17467

feat: Make join selection configurable through `enable_*` and `join_method_priority` options #17467

Uh oh!

2010YOUY01 commented Sep 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Sep 8, 2025

Uh oh!

2010YOUY01 Sep 8, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Sep 9, 2025

Uh oh!

2010YOUY01 Sep 13, 2025

Uh oh!

alamb Sep 9, 2025

Uh oh!

2010YOUY01 Sep 13, 2025

Uh oh!

2010YOUY01 commented Sep 9, 2025

Uh oh!

alamb commented Sep 10, 2025

Uh oh!

2010YOUY01 commented Sep 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	pub prefer_hash_join: bool, default = false
	pub prefer_hash_join: bool, default = true

feat: Make join selection configurable through enable_* and join_method_priority options #17467

Are you sure you want to change the base?

feat: Make join selection configurable through enable_* and join_method_priority options #17467

Uh oh!

Conversation

2010YOUY01 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 13, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 13, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Sep 9, 2025

Uh oh!

alamb commented Sep 10, 2025

Uh oh!

2010YOUY01 commented Sep 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: Make join selection configurable through `enable_*` and `join_method_priority` options #17467

feat: Make join selection configurable through `enable_*` and `join_method_priority` options #17467

2010YOUY01 commented Sep 8, 2025 •

edited

Loading