Skip to content

Conversation

@CTTY
Copy link
Collaborator

@CTTY CTTY commented Aug 19, 2025

Which issue does this PR close?

What changes are included in this PR?

  • Add sort_by_partition to sort the repartitioned data by Iceberg partition values to allow a non-fanout writer able to write all the data from a DataFusion partition.
    • The actual sorting is pushed down to DataFusion's SortExec using LexOrdering of the projected partition column value
  • This PR does NOT integrate the sort node with IcebergTableProvider::insert_into, it should be done after Make fanout/clustered mode configurable for datafusion write #1834 is addressed: We should sort when user is using clustered writing mode and the data is projected/partitioned

Are these changes tested?

Added UTs

@CTTY CTTY force-pushed the ctty/sort-partition branch from df31c62 to b15b2ea Compare November 18, 2025 21:34
@CTTY CTTY changed the title feat(datafusion): Add IcebergPartitionSortExec to sort data for partitioned table feat(datafusion): Add sort_by_partition to sort the input partitioned data Nov 18, 2025
@CTTY CTTY marked this pull request as ready for review November 18, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Sort Node: Sort the input data

1 participant