Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Oct 23, 2025

What changes were proposed in this pull request?

This PR adds support for the Iterator[pandas.DataFrame] API in groupBy().applyInPandas(), enabling batch-by-batch processing of grouped data for improved memory efficiency and scalability.

Key Changes:

  1. New PythonEvalType: Added SQL_GROUPED_MAP_PANDAS_ITER_UDF to distinguish iterator-based UDFs from standard grouped map UDFs

  2. Type Inference: Implemented automatic detection of iterator signatures:

    • Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
    • Tuple[Any, ...], Iterator[pd.DataFrame] -> Iterator[pd.DataFrame]
  3. Streaming Serialization: Created GroupPandasIterUDFSerializer that streams results without materializing all DataFrames in memory

  4. Configuration Change: Updated FlatMapGroupsInPandasExec which was hardcoding pythonEvalType = 201 instead of extracting it from the UDF expression (mirrored fix from FlatMapGroupsInArrowExec)

Why are the changes needed?

The existing applyInPandas() API loads entire groups into memory as single DataFrames. For large groups, this can cause OOM errors. The iterator API allows:

  • Memory Efficiency: Process data batch-by-batch instead of materializing entire groups
  • Scalability: Handle arbitrarily large groups that don't fit in memory
  • Consistency: Mirrors the existing applyInArrow() iterator API design

Does this PR introduce any user-facing changes?

Yes, this PR adds a new API variant for applyInPandas():

Before (existing API, still supported):

def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    return pdf.assign(v=(pdf.v - pdf.v.mean()) / pdf.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

After (new iterator API):

from typing import Iterator

def normalize(batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    # Process data batch-by-batch
    for batch in batches:
        yield batch.assign(v=(batch.v - batch.v.mean()) / batch.v.std())

df.groupBy("id").applyInPandas(normalize, schema="id long, v double")

With Grouping Keys:

from typing import Iterator, Tuple, Any

def sum_by_key(key: Tuple[Any, ...], batches: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    total = 0
    for batch in batches:
        total += batch['v'].sum()
    yield pd.DataFrame({"id": [key[0]], "total": [total]})

df.groupBy("id").applyInPandas(sum_by_key, schema="id long, total double")

Backward Compatibility: The existing DataFrame-to-DataFrame API is fully preserved and continues to work without changes.

How was this patch tested?

  • Added test_apply_in_pandas_iterator_basic - Basic functionality test
  • Added test_apply_in_pandas_iterator_with_keys - Test with grouping keys
  • Added test_apply_in_pandas_iterator_batch_slicing - Pressure test with 10M rows, 20 columns
  • Added test_apply_in_pandas_iterator_with_keys_batch_slicing - Pressure test with keys

Was this patch authored or co-authored using generative AI tooling?

Yes, tests generated by Cursor.

@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 27, 2025
@zhengruifeng zhengruifeng changed the title [SPARK-53614] Add Iterator[pandas.DataFrame] support to applyInPandas [SPARK-53614][PYTHON] Add Iterator[pandas.DataFrame] support to applyInPandas Oct 29, 2025
Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a few minor comments

from pyspark.sql.connect.udf import UserDefinedFunction
from pyspark.sql.connect.dataframe import DataFrame
from pyspark.sql.pandas.typehints import infer_group_pandas_eval_type_from_func
import warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import warnings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

self.assertEqual(expected, result)

def test_apply_in_pandas_iterator_with_keys_batch_slicing(self):
from typing import Iterator, Tuple, Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such imports should be move to the head of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved


def test_apply_in_pandas_iterator_process_multiple_input_batches(self):
from typing import Iterator
import builtins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need import builtins?
I think there is no name conflict if we use sf.max/min/sum in this file

Copy link
Contributor Author

@Yicong-Huang Yicong-Huang Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow when I use sum directly it would use column.sum. Do you know the reason? I changed to use builtins to avoid this conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved typing import

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have column.sum, do you mean sf.sum?
in some test files, sum is imported, so the builtin sum is overridden

)

# Verify that all rows are present after concatenation
self.assertEqual(len(result), 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's directly compare the rows

self.assertEqual(result, [Row(...), Row(...), ...])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants