Skip to content

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Jul 15, 2025

This reverts commit 8b43757.

What changes were proposed in this pull request?

Reverts SPARK-52576. I.e. truncates + alters instead of drop + recreate, for materialized views and full refreshes.

Why are the changes needed?

Some pipeline runs result in wiping out and replacing all the data for a table:

  • Every run of a materialized view
  • Runs of streaming tables that have the "full refresh" flag

Prior to SPARK-52576, this "wipe out and replace" was implemented by:

  • Truncating the table
  • Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run

However, we discovered that this didn't work on Hive. So we moved to drop + recreate, which did work on Hive. However, compared to truncate + alter, drop + recreate has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.

This Hive behavior was fixed here: #51007.

So now we can switch back to truncate + alter.

Does this PR introduce any user-facing change?

Yes, described above

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

@sryza sryza requested review from gengliangwang and cloud-fan July 15, 2025 15:53
@github-actions github-actions bot added the SQL label Jul 15, 2025
@dongjoon-hyun
Copy link
Member

cc @szehon-ho , too

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great it works

@szehon-ho
Copy link
Member

One minor note, the Hive behavior to allow the replace column is actually with #51373

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in eaf2017 Jul 15, 2025
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
This reverts commit 8b43757.

### What changes were proposed in this pull request?

Reverts SPARK-52576. I.e. truncates + alters instead of drop + recreate, for materialized views and full refreshes.

### Why are the changes needed?

Some pipeline runs result in wiping out and replacing all the data for a table:
- Every run of a materialized view
- Runs of streaming tables that have the "full refresh" flag

Prior to SPARK-52576, this "wipe out and replace" was implemented by:
- Truncating the table
- Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run

However, we discovered that this didn't work on Hive. So we moved to drop + recreate, which did work on Hive. However, compared to truncate + alter, drop + recreate has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.

This Hive behavior was fixed here: apache#51007.

So now we can switch back to truncate + alter.

### Does this PR introduce _any_ user-facing change?

Yes, described above

### How was this patch tested?

Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#51497 from sryza/revert-drop-recreate.

Authored-by: Sandy Ryza <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants