Skip to content

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Aug 15, 2025

What changes were proposed in this pull request?

Arrow-optimized User-Defined Table Functions (UDTFs) have arbitrary output cardinality, meaning they can produce different numbers of rows for the same batch input rows. This fundamental characteristic is incompatible with lateral join semantics which combines each row from LHS with all rows from RHS for that particular input row. This PR implements a restriction that blocks Arrow Python UDTFs on the right hand side of lateral join: SELECT * FROM table, LATERAL arrow_udtf(...).

It is always recommended to use table argument with arrow UDTFs: SELECT * FROM arrow_udtf(table(...))

Note Regular UDTFs with lateral joins are allowed (unchanged behavior).

Why are the changes needed?

Prevents runtime errors and inconsistent results while maintaining full Arrow UDTF functionality for supported patterns.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

Comment on lines 484 to 487
"""
SELECT t.id, f.x, f.result
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one more indent?

Suggested change
"""
SELECT t.id, f.x, f.result
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f
"""
"""
SELECT t.id, f.x, f.result
FROM test_table t, LATERAL simple_arrow_udtf(t.id) f
"""

Comment on lines 4255 to 4256


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this change?

LateralSubquery(SubqueryAlias(alias, tvfWithTableColumnIndexes)), Inner, None)

// Set the tag so that it can be used to differentiate lateral join added by
// TABLE argument vs added by user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why arrow UDTF is fine for LATERAL JOIN generated by TABLE arguement?

Copy link
Contributor Author

@allisonwang-db allisonwang-db Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lateral join is only used for analyzer to resolve the plan and it will be removed during the optimization stage and become:

EvalPythonUDTFArrow
+- SubqueryAlias
  +- Table Subquery

@@ -4086,6 +4086,13 @@
],
"sqlState" : "42K0L"
},
"LATERAL_JOIN_WITH_ARROW_UDTF_UNSUPPORTED" : {
"message" : [
"LATERAL JOIN with Arrow-optimized user-defined table functions (UDTFs) is not supported. Arrow UDTFs cannot be used on the right-hand side of a lateral join.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why arrow UDTF cannot support lateral join?
is it because the row id is missing? Just wondering can we provide a arrow column containing the row id, so that the JVM side can join input rows and output rows again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's one option to support lateral join with arrow UDTFs, but I expect most of the use cases will be via TABLE arguments.

@allisonwang-db allisonwang-db force-pushed the spark-52982-disallow-lateral-join branch from b245030 to 4d88390 Compare August 21, 2025 00:50
@allisonwang-db
Copy link
Contributor Author

Thanks for the review! Merge to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants