Skip to content

[SPARK-53330][SQL][PYTHON] Fix Arrow UDF with DayTimeIntervalType (bounds != start/end) #52077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

benrobby
Copy link

@benrobby benrobby commented Aug 19, 2025

What changes were proposed in this pull request?

  • makes ArrowEvalPythonExec type check more lenient when comparing DayTimeIntervalTypes so that they are considered equal when the source type has more information than the target type. The arrow serialization always sends full intervals either way, so that should always be true. We can then rely on the engine to interpret the data according to the node's output type.

Why are the changes needed?

When a pyspark udf (useArrow=true) returns interval type data, it currently errors with below error when the resultType (e.g., DayTimeIntervalType) has begin/end that don't span the maximum range.

org.apache.spark.SparkException: [ARROW_TYPE_MISMATCH] Invalid schema from pandas_udf(): expected DayTimeIntervalType(1,3), got DayTimeIntervalType(0,3). SQLSTATE: 42K0G

Repro:

from pyspark.sql.types import DayTimeIntervalType
from pyspark.sql.functions import udf
 
# this works
@udf(useArrow=True, returnType=DayTimeIntervalType(0, 3))
def return_interval1(x):
  return x

# this fails, although it matches the input type HOUR TO SECOND
@udf(useArrow=True, returnType=DayTimeIntervalType(1, 3)) 
def return_interval2(x):
  return x

spark.sql("SELECT INTERVAL '10:30:45.123' HOUR TO SECOND as value").select(return_interval2("value")).collect()

The cause is that when the worker sends data back, it is always just sends a full arrow duration, which does not remember begin or end index. In above example, the begin should be HOUR (1), and that causes the node to throw said ARROW_TYPE_MISMATCH.

YearToMonthIntervalType is not supported in arrow udfs, so that is currently not a concern.

Does this PR introduce any user-facing change?

Yes, a bug fix that enables behavior that previously threw an error.

How was this patch tested?

  • added python tests

Was this patch authored or co-authored using generative AI tooling?

No

@benrobby benrobby changed the title [SPARK-53330] Fix Arrow UDF support for DayTimeIntervalType with bounds != start-end [SPARK-53330] Fix Arrow UDF with DayTimeIntervalType with bounds != start-end Aug 19, 2025
@benrobby benrobby changed the title [SPARK-53330] Fix Arrow UDF with DayTimeIntervalType with bounds != start-end [SPARK-53330] Fix Arrow UDF with DayTimeIntervalType (bounds != start/end) Aug 19, 2025
@benrobby
Copy link
Author

Hi @HyukjinKwon, I see that you authored the DayTimeInterval support, could you take a look?

@HyukjinKwon HyukjinKwon changed the title [SPARK-53330] Fix Arrow UDF with DayTimeIntervalType (bounds != start/end) [SPARK-53330][SQL][PYTHON] Fix Arrow UDF with DayTimeIntervalType (bounds != start/end) Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants