-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53294][SS] Enable StateDataSource with state checkpoint v2 (only batchId option) #52047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -617,6 +617,66 @@ StateDataSourceReadSuite { | |
} | ||
} | ||
|
||
class RocksDBWithCheckpointV2StateDataSourceReaderSuite extends StateDataSourceReadSuite { | ||
override protected def newStateStoreProvider(): RocksDBStateStoreProvider = | ||
new RocksDBStateStoreProvider | ||
|
||
import testImplicits._ | ||
|
||
override def beforeAll(): Unit = { | ||
super.beforeAll() | ||
spark.conf.set(SQLConf.STATE_STORE_CHECKPOINT_FORMAT_VERSION, 2) | ||
spark.conf.set(SQLConf.STATE_STORE_PROVIDER_CLASS.key, | ||
newStateStoreProvider().getClass.getName) | ||
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled", | ||
"true") | ||
} | ||
|
||
test("check unsupported modes with checkpoint v2") { | ||
withTempDir { tmpDir => | ||
val inputData = MemoryStream[(Int, Long)] | ||
val query = getStreamStreamJoinQuery(inputData) | ||
testStream(query)( | ||
StartStream(checkpointLocation = tmpDir.getCanonicalPath), | ||
AddData(inputData, (1, 1L), (2, 2L), (3, 3L), (4, 4L), (5, 5L)), | ||
ProcessAllAvailable(), | ||
Execute { _ => Thread.sleep(2000) }, | ||
StopStream | ||
) | ||
|
||
// Verify reading snapshot throws error with checkpoint v2 | ||
val exc1 = intercept[StateDataSourceInvalidOptionValue] { | ||
val stateSnapshotDf = spark.read.format("statestore") | ||
.option("snapshotPartitionId", 2) | ||
.option("snapshotStartBatchId", 0) | ||
.option("joinSide", "left") | ||
.load(tmpDir.getCanonicalPath) | ||
stateSnapshotDf.collect() | ||
} | ||
|
||
checkError(exc1, "STDS_INVALID_OPTION_VALUE.WITH_MESSAGE", "42616", | ||
Map( | ||
"optionName" -> StateSourceOptions.SNAPSHOT_START_BATCH_ID, | ||
"message" -> "Snapshot reading is currently not supported with checkpoint v2.")) | ||
|
||
// Verify reading change feed throws error with checkpoint v2 | ||
val exc2 = intercept[StateDataSourceInvalidOptionValue] { | ||
val stateDf = spark.read.format("statestore") | ||
.option(StateSourceOptions.READ_CHANGE_FEED, value = true) | ||
.option(StateSourceOptions.CHANGE_START_BATCH_ID, 0) | ||
.option(StateSourceOptions.CHANGE_END_BATCH_ID, 1) | ||
.load(tmpDir.getAbsolutePath) | ||
stateDf.collect() | ||
} | ||
|
||
checkError(exc2, "STDS_INVALID_OPTION_VALUE.WITH_MESSAGE", "42616", | ||
Map( | ||
"optionName" -> StateSourceOptions.READ_CHANGE_FEED, | ||
"message" -> "Read change feed is currently not supported with checkpoint v2.")) | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a test that creates the checkpoint with checkpointv2 and tries to read from it with the config set to checkpoint v1 and vice-versa? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, this will error out when reading the commit log. Should we do something like only checking this assertion if we are calling from the context of a streaming query? Or maybe just check for this assertion error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I understand this. The reader should figure out the checkpoint format version it needs to use to read based on the commit log version right ? Which other config are we referring to here ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It currently just trims the version number from the input then verifies it is equal to the expected version according to the sparkSession.conf There are a few different assertions in the methods we call that require the I see a few options:
What do you think? |
||
} | ||
|
||
abstract class StateDataSourceReadSuite extends StateDataSourceTestBase with Assertions { | ||
|
||
import testImplicits._ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we add the operator specific tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending
StateDataSourceReadSuite
will run all the tests in StateDataSourceReadSuite with the config set in beforeAll. This pattern already exists in this suite (example). I also attached the result of running this command to the PR description:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future when
changeStartBatchId
is added we will have to generate the golden files for those tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you extend RocksDBStateDataSourceReadSuite here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending RocksDBStateDataSourceReadSuite would only add one test that tests for invalid options. It also sets:
I would prefer extending StateDataSourceReadSuite for now with plans to eventually extend RocksDBWithChangelogCheckpointStateDataSourceReaderSuite to include the tests with
changeStartBatchId
.I added:
To the test suite I added.