[SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side #52067

bersprockets · 2025-08-18T19:09:29Z

What changes were proposed in this pull request?

After e861b0d, shuffle hash join for left semi/anti/existence will ignore duplicate keys if the join condition is empty or refers to the same parent attributes as the join keys. This PR proposes that duplicate keys should be ignored only when the join condition has these properties:

a subtree that is a semantic match to a build-side key, and/or
all attributes, outside of any subtree that is a semantic match to a build-side join key, should be from the stream-side.

Why are the changes needed?

e861b0d causes a correctness issue when a column is transformed in the build-side join keys and also transformed, but differently, in a join condition. As an example:

create or replace temp view data(a) as values
("xxxx1111"),
("yyyy2222");

create or replace temp view lookup(k) as values
("xxxx22"),
("xxxx33"),
("xxxx11");

-- this returns one row
select *
from data
left semi join lookup
on substring(a, 1, 4) = substring(k, 1, 4)
and substring(a, 1, 6) >= k;

-- this is the same query as above, but with a shuffle hash join hint, and returns no rows
select /*+ SHUFFLE_HASH(lookup) */ *
from data
left semi join lookup
on substring(a, 1, 4) = substring(k, 1, 4)
and substring(a, 1, 6) >= k;

When the join uses broadcast hash join, the hashrelation of lookup has the following key -> values:

Key xxxx:
  xxxx11
  xxxx33
  xxxx22

The join condition matches on the build side row with the value xxxx11.

When the join uses shuffle hash join, on the other hand, the hash relation of lookup has the following key -> values:

Key xxxx:
  xxxx22

Because the keys must be unique, an arbitrary row is chosen to represent the key, and that row does not match the join condition.

After 1f35577, a similar issue happens with integer keys:

create or replace temp view data(a) as values
(10000),
(30000);

create or replace temp view lookup(k) as values
(1000),
(1001),
(1002),
(1003),
(1004);

-- this query returns one row
select * from data left semi join lookup on a/10000 = cast(k/1000 as int) and k >=  a/10 + 3;

-- this is the same query as above, but with a shuffle hash join hint, and returns no rows
select /*+ SHUFFLE_HASH(lookup) */ * from data left semi join lookup on a/10000 = cast(k/1000 as int) and k >=  a/10 + 3;

Does this PR introduce any user-facing change?

No, except for fixing the correctness issue.

How was this patch tested?

Modified an existing unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

YuzhouSun · 2025-08-19T07:04:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala

-      // 2. Join condition only references streamed attributes and build join keys.
-      val streamedOutputAndBuildKeys = AttributeSet(streamedOutput ++ buildKeys)
-      condition.forall(_.references.subsetOf(streamedOutputAndBuildKeys))
+    case LeftExistence(_) if condition.isEmpty => true


Thanks for fixing this issue!

Instead of disable ignoreDuplicatedKey in this case, Is it possible to relax the requirement to that the references in conditions are all in streamedOutput ++ buildKeysThatAreAttributes ? E.g. diff:

- val streamedOutputAndBuildKeys = AttributeSet(streamedOutput ++ buildKeys) + val attrBuildKeys = buildKeys.filter(_.isInstanceOf[Attribute]) + val streamedOutputAndBuildKeys = AttributeSet(streamedOutput ++ attrBuildKeys)”

The current master branch allows buildKeys’ references in the condition (AttributeSet extracts references), while this diff limits it to streamedOutput and buildKeys that are Attributes

Or maybe something like:

val streamedOutputSet = AttributeSet(streamedOutput) val buildKeysSet = ExpressionSet(buildKeys) condition.forall { c => var valid = true c.transformDownWithPruning(c => valid && !buildKeysSet.contains(c.asInstanceOf[Expression])) { case a: Attribute if !streamedOutputSet.contains(a) => valid = false a } valid }

to allow full expressions from buildKeys to appear in condition.

@peter-toth

It's hard for me to grok that traversal, but I think the gist is:

Any subtree of condition that is not a semantic match to a build-side join key must be checked for naughty attributes.

Set valid to false when we hit a naughty attribute in such a subtree.

We can short circuit the traversal once valid is set to false.

I wish there was another way to express that, but I don't know how else to skip selected subtrees of condition.

Yes, exactly. But we can write a recursive function if that's easier to grasp.

Your's is nice and compact, but I'll try an explicit traversal and see how sprawling it gets.

peter-toth

This is a good catch!
Fix looks good to me, but maybe we can improve it a bit.

peter-toth · 2025-08-21T20:34:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala

+    validCond0(cond, buildKeysSet, streamedOutputAttrs)
+  }
+
+  private def validCond0(cond: Expression,


This doesn't need to be a top level function, but can be nested under validCondForIgnoreDupKey() and you don't need to pass in buildKeysSet and streamedOutputAttrs that way, but this is just a nit.

I will fix that.

bersprockets added 3 commits August 14, 2025 14:22

Add test

cdc5717

Update

b4ea794

Update test name

e6c2b39

github-actions bot added the SQL label Aug 18, 2025

bersprockets changed the title ~~[SPARK-52873][SQL] Don't ignore duplicate keys in SHJ when there is a bound condition~~ [SPARK-52873][SQL] SHJ shouldn't ignore duplicate keys when there is a bound condition Aug 18, 2025

YuzhouSun reviewed Aug 19, 2025

View reviewed changes

peter-toth approved these changes Aug 19, 2025

View reviewed changes

Update

cf7afcd

bersprockets changed the title ~~[SPARK-52873][SQL] SHJ shouldn't ignore duplicate keys when there is a bound condition~~ [SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side Aug 21, 2025

peter-toth reviewed Aug 21, 2025

View reviewed changes

Update

5933402

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side #52067

[SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side #52067

bersprockets commented Aug 18, 2025 •

edited

Loading

Uh oh!

YuzhouSun Aug 19, 2025

Uh oh!

peter-toth Aug 19, 2025 •

edited

Loading

Uh oh!

bersprockets Aug 20, 2025

Uh oh!

peter-toth Aug 20, 2025 •

edited

Loading

Uh oh!

bersprockets Aug 20, 2025

Uh oh!

peter-toth left a comment

Uh oh!

peter-toth Aug 21, 2025

Uh oh!

bersprockets Aug 21, 2025

Uh oh!

Uh oh!

[SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side #52067

Are you sure you want to change the base?

[SPARK-52873][SQL] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side #52067

Conversation

bersprockets commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

YuzhouSun Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

bersprockets Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bersprockets commented Aug 18, 2025 •

edited

Loading

peter-toth Aug 19, 2025 •

edited

Loading

peter-toth Aug 20, 2025 •

edited

Loading