[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

sryza · 2025-06-25T17:51:21Z

What changes were proposed in this pull request?

Some pipeline runs result in wiping out and replacing all the data for a table:

Every run of a materialized view
Runs of streaming tables that have the "full refresh" flag

In the current implementation, this "wipe out and replace" is implemented by:

Truncating the table
Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run

The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs.

However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter.

Why are the changes needed?

See section above.

Does this PR introduce any user-facing change?

Yes, see section above. No releases contained the old behavior.

How was this patch tested?

Tests in MaterializeTablesSuite
Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog

Was this patch authored or co-authored using generative AI tooling?

No

szehon-ho

Makes sense, can change if we add support for drop Column for HMS in the V2SessionCatalog

szehon-ho · 2025-06-25T21:04:12Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

+    val dropTable = (isFullRefresh || !table.isStreamingTableOpt.get) && existingTableOpt.isDefined
+    if (dropTable) {
+      catalog.dropTable(identifier)
+//      context.spark.sql(s"DROP TABLE ${table.identifier.quotedString}")


nit: remove? Optionally add comment about why not truncate/alter?

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

gengliangwang · 2025-06-25T22:31:54Z

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/MaterializeTablesSuite.scala

@@ -446,8 +446,9 @@ class MaterializeTablesSuite extends BaseCoreExecutionTest {

    val table2 = catalog.loadTable(identifier)
    assert(
-      table2.columns() sameElements CatalogV2Util
-        .structTypeToV2Columns(new StructType().add("y", IntegerType).add("x", BooleanType))
+      table2.columns().toSet == CatalogV2Util


why do we need this change?

The ordering of columns does not appear to be deterministic (at least across different catalog implementations). Is that unexpected?

for a table, the column order matters. I think we should keep the test as it is and fix the issues we found.

Are we able to get some help with fixing this? What I'm observing is that, with Hive, when I create a table using the following:

catalog.createTable( identifier, new TableInfo.Builder() .withProperties(mergedProperties.asJava) .withColumns(CatalogV2Util.structTypeToV2Columns(outputSchema)) .withPartitions(partitioning.toArray) .build() )

and then later fetch the columns using

catalog.loadTable(identifier).columns()

The columns are returned in a different order than they appear in outputSchema.

This happens only with Hive, not the default catalog.

strange, i can take a look

I ran the test case in HiveDDLSuite a few times and cant reproduce it.

val catalog = spark.sessionState.catalogManager.currentCatalog.asInstanceOf[TableCatalog] withTable("t1") { val identifier = Identifier.of(Array("default"), "t1") val outputSchema = new StructType() .add("a", IntegerType, true, "comment1") .add("b", IntegerType, true, "comment2") .add("c", IntegerType, true, "comment3") .add("d", IntegerType, true, "comment4") catalog.createTable( identifier, new TableInfo.Builder() .withProperties(Map.empty.asJava) .withColumns(CatalogV2Util.structTypeToV2Columns(outputSchema)) .withPartitions(Array.empty) .build() ) val cols = catalog.loadTable(identifier).columns() assert(cols.length == 4) assert(cols(0).name() == "a") assert(cols(1).name() == "b") assert(cols(2).name() == "c") assert(cols(3).name() == "d") }

Is it reproducible with this pr itself?

I think we should fix the Hive catalog to respect the user-specified column order. For now, I'm fine with ignoring some test cases temporarily in the upcoming hive suite, by overriding def excluded from SparkFunSuite, or add column-order methods and override it in the hive suite.

This is going to be a breaking change for HiveCatalog, should we do it off by default and enable via flag ? Looks like some history: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L820

@cloud-fan @sryza @gengliangwang i made a patch here, does it look like what we want? #51342

OK I've updated this PR to bring back the asserts on the ordering, and we can deal with the Hive tests in the PR that introduces them. @cloud-fan mind taking another look?

Hmm for some reason the original order is working now. Switched it back and now the tests are passing.

@sryza so may not need #51342 after all? :) We can close it then if its not needed, let me know. At least , as part of it, we managed to make the underlying HiveExternalCatalog API better in #51373, that fixes a small issue (SPARK-52681). By the way, it also doesnt prevent you to drop the column and replace them now, can you give a try as well after this pr?

cloud-fan · 2025-07-03T08:19:22Z

there is a test failure in MaterializeTablesSuite

sryza · 2025-07-07T17:22:41Z

Hmm for some reason the original order is working now. Switched it back and now the tests are passing.

cloud-fan · 2025-07-08T06:25:07Z

thanks, merging to master!

### What changes were proposed in this pull request? Some pipeline runs result in wiping out and replacing all the data for a table: - Every run of a materialized view - Runs of streaming tables that have the "full refresh" flag In the current implementation, this "wipe out and replace" is implemented by: - Truncating the table - Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs. However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter. ### Why are the changes needed? See section above. ### Does this PR introduce _any_ user-facing change? Yes, see section above. No releases contained the old behavior. ### How was this patch tested? - Tests in MaterializeTablesSuite - Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51280 from sryza/drop-on-full-refresh. Authored-by: Sandy Ryza <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

sryza requested review from gengliangwang and cloud-fan June 25, 2025 17:51

github-actions bot added the SQL label Jun 25, 2025

sryza changed the title ~~[SDP] Drop/recreate on full refresh and MV update~~ [SDP] [SPARK-52576] Drop/recreate on full refresh and MV update Jun 25, 2025

szehon-ho approved these changes Jun 25, 2025

View reviewed changes

gengliangwang reviewed Jun 25, 2025

View reviewed changes

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Jun 25, 2025

View reviewed changes

sryza requested review from gengliangwang and szehon-ho June 26, 2025 15:11

szehon-ho mentioned this pull request Jul 1, 2025

[SPARK-52638][SQL] Allow preserving Hive-style column order to be configurable #51342

Closed

sryza added 3 commits July 2, 2025 13:37

drop on full refresh

d527ee7

remove commented-out code

ad8078a

bring back column ordering in test assert

c6280c4

sryza force-pushed the drop-on-full-refresh branch from 38c4d8d to c6280c4 Compare July 2, 2025 20:38

cloud-fan approved these changes Jul 3, 2025

View reviewed changes

switch back order

6b5d7b4

sryza force-pushed the drop-on-full-refresh branch from 4fbcae0 to 6b5d7b4 Compare July 7, 2025 15:21

szehon-ho approved these changes Jul 7, 2025

View reviewed changes

cloud-fan closed this in 8b43757 Jul 8, 2025

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

[SDP] [SPARK-52576] Drop/recreate on full refresh and MV update #51280

Uh oh!

Conversation

sryza commented Jun 25, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gengliangwang Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sryza Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 3, 2025

Uh oh!

sryza commented Jul 7, 2025

Uh oh!

cloud-fan commented Jul 8, 2025

Uh oh!

Uh oh!

szehon-ho Jun 25, 2025 •

edited

Loading

sryza Jun 26, 2025 •

edited

Loading

szehon-ho Jun 26, 2025 •

edited

Loading

szehon-ho Jun 30, 2025 •

edited

Loading

szehon-ho Jul 1, 2025 •

edited

Loading

szehon-ho Jul 7, 2025 •

edited

Loading