-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 4.0: Add truncate transform tests (non-SPJ only). #13907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@huaxingao Could you please help review this PR? Many thanks! While checking the SPJ unit tests, I noticed a |
|
|
||
| // TODO: add tests for truncate transforms once SPARK-40295 is released | ||
| // TODO: Truncate is not supported by SPJ yet (even after SPARK-40295). | ||
| // Add tests for full support once SPARK-50593 is resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we change to something like
// TODO: SPJ does not currently leverage truncate(...) partition transforms for partition alignment.
// SPARK-40295 improved related areas, but full truncate support is tracked in SPARK-50593.
// This test documents current behavior; update/extend once SPARK-50593 lands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I will improve the code accordingly.
| sql( | ||
| "CREATE TABLE %s (id BIGINT, int_col INT, dep STRING) " | ||
| + "USING iceberg " | ||
| + "PARTITIONED BY (truncate(4, dep)) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name says ...IncompatibleTruncateSpecs but both tables use the same partitioning (truncate(4, dep)). Could you change the test name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your suggestion and have renamed the method to testJoinWithTruncatePartitioning.
| sql("INSERT INTO %s VALUES (3L, 300, 'software')", tableName(OTHER_TABLE_NAME)); | ||
|
|
||
| assertPartitioningAwarePlan( | ||
| 3, /* expected num of shuffles with SPJ */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a comment
// TODO(SPARK-50593): Once truncate transforms are leveraged by SPJ, expected shuffles with SPJ should drop to 1. Update expectedNumShufflesWithSPJ accordingly.
- Renamed method from testJoinsWithIncompatibleTruncateSpecs to testJoinsWithTruncatePartitioning since both tables use the same truncate(4, dep). - Added TODO comment referencing SPARK-40295 and SPARK-50593 to document current limitations and indicate future improvements.
| // This test documents current behavior; update/extend once SPARK-50593 lands. | ||
| @TestTemplate | ||
| public void testJoinWithTruncatePartitioning() { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I will improve this code.
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| sql("INSERT INTO %s VALUES (2L, 200, 'software')", tableName(OTHER_TABLE_NAME)); | ||
| sql("INSERT INTO %s VALUES (3L, 300, 'software')", tableName(OTHER_TABLE_NAME)); | ||
|
|
||
| // TODO(SPARK-50593): Once truncate transforms are leveraged by SPJ, expected shuffles with SPJ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we gain from the tests prior to 50593 going in. I would probably hold off on this unless we think we are getting some other coverage here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! I understand your point and agree to some extent. However, I believe we should at least improve the comments to indicate that, even though SPARK-40295 is ready, we still need to wait for SPARK-50593 before we can proceed with this TODO.
Let me briefly explain the reason for this improvement:
When I first discovered this TODO, I checked SPARK-40295 and found that it was ready. I then tested it locally and found that, even with SPARK-40295 implemented, truncate still cannot utilize Spark's SPJ features. My initial thought was to directly modify the TODO, but I was concerned that a review might inquire about the issue I encountered. So, I decided to document my verification tests here. Currently, this test can only verify that, under SPARK-40295, truncate still does not support SPJ.
|
Thank you very much to @huaxingao , @singhpk234 , and @RussellSpitzer for reviewing this PR and for your time. The comment below is no longer accurate, and I personally think it’s necessary to modify it. iceberg/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestStoragePartitionedJoins.java Line 136 in fdd04c9
However, I was also thinking that submitting a PR just to modify a single line of comment might seem a bit trivial. If I can find another relevant change, I’d be happy to include that as well. Additionally, I’ve added the JIRA link for |
|
After SPARK-50593 is addressed, it would be better to update and improve this unit test. Therefore, I will close this PR for now and follow up on SPARK-50593, with the aim of completing the support for the truncate function in SPJ scenarios. Thanks again to @huaxingao , @singhpk234 , and @RussellSpitzer for your attention and comments! |
What changes were proposed in this pull request?
This PR adds unit tests for
truncate(...)partition transforms after SPARK-40295. Currently, truncate-based partitioning is not yet supported by SPJ. Therefore, this PR only covers non-SPJ scenarios (e.g. correctness and query results).Why are the changes needed?
To validate truncate transforms are working as expected in Iceberg tables and Spark queries. Full SPJ support for truncate requires SPARK-50593. Once that issue is resolved, we can extend the tests to cover shuffle reduction with SPJ enabled.