-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52982][PYTHON] Disallow lateral join with Arrow Python UDTFs #52048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52982][PYTHON] Disallow lateral join with Arrow Python UDTFs #52048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
""" | ||
SELECT t.id, f.x, f.result | ||
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one more indent?
""" | |
SELECT t.id, f.x, f.result | |
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f | |
""" | |
""" | |
SELECT t.id, f.x, f.result | |
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f | |
""" |
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: revert this change?
LateralSubquery(SubqueryAlias(alias, tvfWithTableColumnIndexes)), Inner, None) | ||
|
||
// Set the tag so that it can be used to differentiate lateral join added by | ||
// TABLE argument vs added by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why arrow UDTF is fine for LATERAL JOIN generated by TABLE arguement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lateral join is only used for analyzer to resolve the plan and it will be removed during the optimization stage and become:
EvalPythonUDTFArrow
+- SubqueryAlias
+- Table Subquery
@@ -4086,6 +4086,13 @@ | |||
], | |||
"sqlState" : "42K0L" | |||
}, | |||
"LATERAL_JOIN_WITH_ARROW_UDTF_UNSUPPORTED" : { | |||
"message" : [ | |||
"LATERAL JOIN with Arrow-optimized user-defined table functions (UDTFs) is not supported. Arrow UDTFs cannot be used on the right-hand side of a lateral join.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why arrow UDTF cannot support lateral join?
is it because the row id is missing? Just wondering can we provide a arrow column containing the row id, so that the JVM side can join input rows and output rows again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's one option to support lateral join with arrow UDTFs, but I expect most of the use cases will be via TABLE arguments.
b245030
to
4d88390
Compare
Thanks for the review! Merge to master. |
What changes were proposed in this pull request?
Arrow-optimized User-Defined Table Functions (UDTFs) have arbitrary output cardinality, meaning they can produce different numbers of rows for the same batch input rows. This fundamental characteristic is incompatible with lateral join semantics which combines each row from LHS with all rows from RHS for that particular input row. This PR implements a restriction that blocks Arrow Python UDTFs on the right hand side of lateral join:
SELECT * FROM table, LATERAL arrow_udtf(...)
.It is always recommended to use table argument with arrow UDTFs:
SELECT * FROM arrow_udtf(table(...))
Note Regular UDTFs with lateral joins are allowed (unchanged behavior).
Why are the changes needed?
Prevents runtime errors and inconsistent results while maintaining full Arrow UDTF functionality for supported patterns.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
No