[Storage] Refactor insert chunk data pack #7939

zhangchiqing · 2025-09-22T21:45:42Z

Work towards #7912

Adds a dedicated LockInsertChunkDataPack so all chunk data pack writes happen under a storage lock.
Wraps Store(chunks) in storage.WithLock(...) instead of direct calls. The storage.WithLock can be reused in the future.

zhangchiqing · 2025-09-22T21:47:13Z

storage/locks.go

 	// LockBootstrapping protects data that is *exclusively* written during bootstrapping.
 	LockBootstrapping = "lock_bootstrapping"
+	// LockInsertChunkDataPack protects the insertion of chunk data packs (not yet used anywhere
+	LockInsertChunkDataPack = "lock_insert_chunk_data_pack"


chunk data pack is stored in its own pebble database, so can't reuse the InsertOwnReceipt lock.

Sorry, can you help by explaining this some more? As I understand it, a storage lock protects a set of keys ... more precisely the lock protects a set of storage codes, such as codeChunkDataPack. Even if those storage codes live in different databases, the locking would still ensure correctness (specifically prevent stale reads)?

I am asking because you write that we "cannot use InsertOwnReceipt", to me that reads like "it is impossible to use InsertOwnReceipt". I just want to clarify my understanding that reusing InsertOwnReceipt would actually be possible, but we decide not to for performance reasons. Would you agree?

This lock will be removed in #7983

codecov-commenter · 2025-09-22T21:49:19Z

Codecov Report

❌ Patch coverage is 27.64228% with 89 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
engine/execution/state/state.go	46.80%	18 Missing and 7 partials ⚠️
storage/chunk_data_packs.go	0.00%	20 Missing ⚠️
model/flow/chunk.go	0.00%	13 Missing ⚠️
storage/operation/chunk_data_packs.go	31.57%	10 Missing and 3 partials ⚠️
storage/locks.go	0.00%	12 Missing ⚠️
storage/mock/chunk_data_packs.go	0.00%	5 Missing ⚠️
storage/store/chunk_data_packs.go	85.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

zhangchiqing · 2025-09-22T21:49:58Z

engine/execution/state/state.go

 	}

-	err = s.chunkDataPacks.Store(chunks)
+	err = storage.WithLock(s.lockManager, storage.LockInsertChunkDataPack, func(lctx lockctx.Context) error {


This process acquires two locks, it's better to use WithLock function instead of AcquireLock, because a misuse of defer lctx.Release will hold the lock for too long.

zhangchiqing · 2025-09-23T01:31:58Z

storage/store/chunk_data_packs.go

-// BatchStore stores ChunkDataPack c keyed by its ChunkID in provided batch.
-// No errors are expected during normal operation, but it may return generic error
-// if entity is not serializable or Badger unexpectedly fails to process request
-func (ch *ChunkDataPacks) BatchStore(c *flow.ChunkDataPack, rw storage.ReaderBatchWriter) error {


BatchStore is not being used anywhere, removed.

jordanschalm · 2025-09-24T16:22:56Z

engine/execution/state/state.go

+		// the number of database interactions. This is a large batch of data, which might not be
+		// committed within a single operation (e.g. if using Badger DB as storage backend, which has
+		// a size limit for its transactions).


Suggested change

// the number of database interactions. This is a large batch of data, which might not be

// committed within a single operation (e.g. if using Badger DB as storage backend, which has

// a size limit for its transactions).

// the number of database interactions.

Assuming this is only true for Badger, I think we could remove this comment now.

jordanschalm · 2025-09-24T16:26:03Z

engine/execution/state/state.go

+				// Rollback if an error occurs during batch operations
+				if err != nil {
+					chunkIDs := make([]flow.Identifier, 0, len(chunks))
+					for _, chunk := range chunks {
+						chunkIDs = append(chunkIDs, chunk.ChunkID)
+					}
+					_ = s.chunkDataPacks.Remove(chunkIDs)


This more complex rollback logic is arising from storing chunk data packs in a separate database, and needing to coordinate commits between the 2 databases.

Rollback error handling can silently swallow errors indicating data corruption

The error handling flow is introducing the possibility of silently corrupting the database: if s.chunkDataPacks.Remove fails, the error is swallowed and the database will be in an unexpected state. The node will continue operating with incorrect data, which we never want to allow.

(Even if there is no error, the "my receipt" and chunk data packs will not be committed atomically from the perspective of other reader threads. I'm not sure if this could be a problem in practice.)

Concurrency safety of rollback mechanism

I'm also not convinced this function won't produce inconsistent database states when invoked concurrently. For example, suppose threads A and B are executing concurrently and storing chunk data pack P and auxiliary information (events, transaction results, etc.) X:
A: write P
B: write P
A: errors while writing X, invoking rollback logic and deleting P
B: successfully writes X

At the end we will have an inconsistent state across the two databases:

P does not exist in the database

X does exist in the database

Both threads have an incorrect view of what happened:

A expects X to not exist, but it does

B expects P to exist, but it doesn't

This function or its callers doesn't document (from what I can tell) whether it is concurrency safe, although it is ultimately called from an engine using multiple worker threads, so I think it at least could be called concurrently. There is also this TODO gesturing at the idea that the component should be concurrency safe.

I'm wondering if we still need to have separate databases now that the main database uses Pebble. My understanding is that we split out chunk data packs primarily so that we could use Pebble's "real" delete feature to prune and save disk space. Now that the primary database uses Pebble, can we return to storing chunk datapacks in the primary database?

While we have the two databases, and these resulting problems, I think we need to:

make sure the current error handling behaviour is acceptable given how this component is being used

document the expectations this component makes of its callers, what guarantees it does/doesn't make

or update the behaviour/usage to better work around the atomicity/consistency issues

Good catch, I moved locks up higher to cover the entire save execution result, which makes it concurrent safe. Also added some TODO to work on the chunk data pack removal issue in the future.

Addressed #7983

model/flow/chunk.go

storage/locks.go

storage/operation/chunk_data_packs.go

jordanschalm · 2025-09-24T22:31:13Z

storage/chunk_data_packs.go

+	if !bytes.Equal(c.Proof, other.Proof) {
+		return ErrDataMismatch
+	}
+	if !c.SystemChunk && c.CollectionID != other.CollectionID {


Suggested change

if !c.SystemChunk && c.CollectionID != other.CollectionID {

if c.CollectionID != other.CollectionID {

Shouldn't they still be equal (zero) when it is a system chunk?

storage/chunk_data_packs.go

AlexHentschel · 2025-09-26T01:34:46Z

storage/chunk_data_packs.go

 	// Store stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
 	// No errors are expected during normal operation, but it may return generic error
-	Store(cs []*flow.ChunkDataPack) error
+	StoreByChunkID(lctx lockctx.Proof, cs []*flow.ChunkDataPack) error


I feel this API here breaks with existing API patterns:

We already have ReaderBatchWriter to batch database interactions.

This function accepts a lock proof but no ReaderBatchWriter - unlike the vast majority of other storage APIs in this package

My understanding of the documentation in the caller seems to be ad odds with the API here:

flow-go/engine/execution/state/state.go

Lines 415 to 422 in 48d0c08

err := s.chunkDataPacks.StoreByChunkID(lctx, chunks)

if err != nil {

return fmt.Errorf("can not store multiple chunk data pack: %w", err)

}

// Save entire execution result (including all chunk data packs) within one batch to minimize

// the number of database interactions.

return s.db.WithReaderBatchWriter(func(batch storage.ReaderBatchWriter) error {

The caller talks about saving the "entire execution result (including all chunk data packs) within one batch to minimize [..] database interactions." But don't we save them independently??

May I guess why you chose a different API? I am hypothesizing that this is because chunk data packs and results are stored in different databases? So we are worried about the same ReaderBatchWriter being accidentally used across the databases? If my speculation is correct:

for now, could you please diligently document this in the interface and implementation

[beyond the scope of this PR], I suspect we need a clearer encapsulation of database and storage abstractions served by that database. Maybe we can introduce different packages. Like a protocol_store package and an execution_store package. Each could have its own type ReaderBatchWriter ... just a random thought to illustrate my thinking.

Nothing to implement now, but probably good to keep in mind and thinking about solutions, because I suspect this is not the last case where we have to make the API inconsistent to avoid accidentally storing an object in the wrong database.

The Store function is refactored in #7983

AlexHentschel · 2025-09-26T01:37:33Z

engine/execution/state/state.go

+		err := s.chunkDataPacks.StoreByChunkID(lctx, chunks)
+		if err != nil {
+			return fmt.Errorf("can not store multiple chunk data pack: %w", err)
+		}

-	lctx := s.lockManager.NewContext()
-	defer lctx.Release()
-	err = lctx.AcquireLock(storage.LockInsertOwnReceipt)
-	if err != nil {
-		return err
-	}
+		// Save entire execution result (including all chunk data packs) within one batch to minimize
+		// the number of database interactions.


I think this comment is not reflecting the current implementation:

Save entire execution result (including all chunk data packs) within one batch to minimize the number of database interactions.

because we are not storing execution result and chunk data packs in the same operation ... maybe I am misunderstanding some detail.

Right. Updated cbc5873 in #7983

AlexHentschel · 2025-09-26T01:52:56Z

storage/chunk_data_packs.go

[entirely optional]
If it's easy, can we change StoredChunkDataPack into an internal datatype of the storage/store package. Conceptually, this would seem cleaner to me. But I suspect is is not easy due to this 👇

flow-go/storage/chunk_data_packs.go

Line 32 in bd219c4

// Its prime difference is instead of an actual collection, it keeps a collection ID hence relying on maintaining

If you agree, but it's too big of a change for this PR, maybe you could create an issue and add it to the epic [Protocol] Misc improvements and tech debt. Also, you don't agree, just feel free to ignore (no detailed response needed). Thanks 🙂

AlexHentschel · 2025-09-26T02:00:50Z

storage/chunk_data_packs.go

-	Store(cs []*flow.ChunkDataPack) error
+	StoreByChunkID(lctx lockctx.Proof, cs []*flow.ChunkDataPack) error

 	// Remove removes multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.


Suggested change

// Remove removes multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.

// Remove multiple ChunkDataPacks with the given chunk IDs (in a batch).

This method is refactored in https://github.com/onflow/flow-go/pull/7983/files

AlexHentschel · 2025-09-26T02:08:06Z

storage/chunk_data_packs.go

 	// Store stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
 	// No errors are expected during normal operation, but it may return generic error


For every function taking a storage lock, please make sure you document the lock diligently (ideally with some context).

Suggested change

// Store stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.

// No errors are expected during normal operation, but it may return generic error

// StoreByChunkID persists multiple [flow.ChunkDataPack]s keyed by their ChunkIDs in a batch.

//

// CAUTION:

// - The chunk ID is _not_ a collision resistant hash of the chunk data pack. For any given chunk, an Execution Node should

// always provide the _same_ chunk data pack (slashable protocol violation otherwise). To prevent bugs in higher-lever logic

// from accidentally overwriting an existing chunk data pack with a different one, we must atomically read and write data here.

// Hence, the caller must acquire the [storage.LockInsertChunkDataPack] and hold it until the database write has been committed.

//

// No error returns are expected during normal operation.

This method is refactored in https://github.com/onflow/flow-go/pull/7983/files

AlexHentschel · 2025-09-26T02:22:39Z

model/flow/chunk.go

+
+// Equals returns true if and only if receiver BlockExecutionDataRoot is equal to the `other`.
+func (b BlockExecutionDataRoot) Equals(other BlockExecutionDataRoot) bool {
+	// Compare BlockID fields


From my perspective, this is trivially apparent and doesn't need documentation (... didn’t think I’d ever say this 😅 )

Suggested change

// Compare BlockID fields

Applied in c37af9c

AlexHentschel · 2025-09-26T02:24:38Z

model/flow/chunk.go

+		return false
+	}
+
+	// Compare ChunkExecutionDataIDs slices


Suggested change

// Compare ChunkExecutionDataIDs slices

Applied in c37af9c

AlexHentschel · 2025-09-26T02:28:45Z

storage/locks.go

 	LockInsertCollection = "lock_insert_collection"
 	// LockBootstrapping protects data that is *exclusively* written during bootstrapping.
 	LockBootstrapping = "lock_bootstrapping"
+	// LockInsertChunkDataPack protects the insertion of chunk data packs (not yet used anywhere


Suggested change

// LockInsertChunkDataPack protects the insertion of chunk data packs (not yet used anywhere

// LockInsertChunkDataPack protects the insertion of chunk data packs

This lock will be removed

AlexHentschel · 2025-09-26T02:39:28Z

engine/execution/state/state.go


+// saveExecutionResults saves all data related to the execution of a block.
+// It is concurrent-safe
 func (s *state) saveExecutionResults(


It would be great if you to document error returns for this function as part of this PR, please.

Added in #7983

AlexHentschel · 2025-09-26T02:43:24Z

engine/execution/state/state.go

+	return storage.WithLocks(s.lockManager, []string{storage.LockInsertOwnReceipt, storage.LockInsertChunkDataPack}, func(lctx lockctx.Context) error {
+		err := s.chunkDataPacks.StoreByChunkID(lctx, chunks)
+		if err != nil {
+			return fmt.Errorf("can not store multiple chunk data pack: %w", err)


I think this error message is somewhat vague. When I first read it, I mistakingly understood that the attempt to store multiple chunk data packs was the reason for the failure.

Suggested change

return fmt.Errorf("can not store multiple chunk data pack: %w", err)

return fmt.Errorf("failed to persist batch of chunk data packs for block %v: %w", blockID, err)

Updated in #7983

AlexHentschel · 2025-09-26T02:49:20Z

engine/execution/state/state.go

+				// Rollback if an error occurs during batch operations
+				// Chunk data packs are saved in a separate database, there is a chance
+				// that execution result was failed to save, but chunk data packs was saved and
+				// didnt get removed.


Suggested change

// Rollback if an error occurs during batch operations

// Chunk data packs are saved in a separate database, there is a chance

// that execution result was failed to save, but chunk data packs was saved and

// didnt get removed.

// Rollback if an error occurs during batch operations. This is critical for data consistency:

// as chunk data packs and results are saved in separate databases. If we reach this code, storing chunk

// data packs has already succeeded, but persisting the execution result is failing. In this case, the

// chunk data packs that were previously saved need to be removed again to maintain data consistency.

Updated in #7983

AlexHentschel · 2025-09-26T03:14:47Z

engine/execution/state/state.go

+				// TODO(leo): when retrieving chunk data packs, we need to add a check to ensure the block
+				// has been executed before returning chunk data packs


This is a subtle dependency that the logic for retrieving chunk data packs has on the logic storing the chunk data packs. In my experience, those dependencies are easily broken over time. I can totally see an engineer thinking: "why do I need to check? If the chunk data pack exists, the chunk clearly has been executed"

I think we can address this. From my perspective, the issue is that:

ENs persist the information whether it has executed a block in the protocol data base (assumption on my end)

in contrast, the mapping from chunk ID to chunk-data-pack is stored in the execution database.

Now imagine the following approach:

In the protocol data base, the EN persists:

information whether it has executed a block (as before)

index from chunk ID to chunk-data-pack hash

the execution database only contains the chunk-data-packs stored by their respective hashes.

With this approach, all queries for a chunk data pack by its chunk ID always go through the protocol database. (alternatively, we could move all of the "my execution-result-related-data", including "my receipts" to the execution database, so that only the execution database needs to be queried and it either has all or none of the information pertaining to some result). Thereby we have complete atomicity in all practically relevant scenarios, including storage failures.

In this context, I want to emphasize again the paradigm of high-assurance engineering, which I previously brought up on slack:

For high-assurance systems, if the caller is ignorant, the API should ideally error making the caller crash

What we want to avoid is an API design, where an ignorant caller (forgetting to check whether the result exists, and just assuming it exists if the API call returns) will not notice and likely behave incorrectly.

Note that we are explicitly thinking about an ignorant caller in the spirt of defensive programming.

Hope that makes sense. If that explanation is too brief, don't hesitate to let me know and we can chat in person. If you agree, but it's too big of a change for this PR, maybe you could create an issue and add it to the epic [Protocol] Misc improvements and tech debt. Thanks 🙂

AlexHentschel · 2025-09-30T04:24:09Z

engine/execution/pruner/core_test.go

randomly noticed this:
In line 43 you could replace pebbleimpl.ToDB(pdb) with the variable you initialized above in line 33

updated 7fa1ef6

AlexHentschel · 2025-09-30T04:31:27Z

engine/execution/state/state.go

-	if err != nil {
-		return fmt.Errorf("can not store multiple chunk data pack: %w", err)
-	}
+	// Acquire both locks to ensure it's concurrent safe when inserting the execution results and chunk data packs.


Suggested change

// Acquire both locks to ensure it's concurrent safe when inserting the execution results and chunk data packs.

// Acquire both locks because we want to persist the execution results *and* chunk data packs atomically in one write batch.

They're in different databases, so we can't commit them atomically in one write batch.

Will be refactored in #7983

AlexHentschel

Looks good.

AlexHentschel · 2025-09-30T04:48:26Z

engine/execution/state/state.go

+		// Save entire execution result (including all chunk data packs) within one batch to minimize
+		// the number of database interactions.
+		return s.db.WithReaderBatchWriter(func(batch storage.ReaderBatchWriter) error {
+			batch.AddCallback(func(err error) {


For better logical flow, I would move this to the end, right before line 477. Then, the logical flow reads:

we index the events

then we index the service events,

...

at the end, we index the end state

and unless all those operations ☝️ succeed, we remove the chunk data packs

AlexHentschel · 2025-09-30T04:52:07Z

engine/execution/state/state.go

-	return s.db.WithReaderBatchWriter(func(batch storage.ReaderBatchWriter) error {
-		batch.AddCallback(func(err error) {
-			// Rollback if an error occurs during batch operations
+			err = s.events.BatchStore(blockID, []flow.EventsList{result.AllEvents()}, batch)


All of the storage operations in lines 439 - 475 are indexing operations, where we have the risks for overwrites. As discussed in our meeting Continue storage API discussion (Friday, Sep 26), all those should contain deduplication checks and hence locks.

yes. This will be done in separate PRs.

AlexHentschel · 2025-09-30T05:05:38Z

storage/store/chunk_data_packs.go

+// StoreByChunkID stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
 // No errors are expected during normal operation, but it may return generic error


Suggested change

// StoreByChunkID stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.

// No errors are expected during normal operation, but it may return generic error

// StoreByChunkID persists multiple [flow.ChunkDataPack]s keyed by their ChunkIDs in a batch.

//

// CAUTION:

// - The chunk ID is _not_ a collision resistant hash of the chunk data pack. For any given chunk, an Execution Node should

// always provide the _same_ chunk data pack (slashable protocol violation otherwise). To prevent bugs in higher-lever logic

// from accidentally overwriting an existing chunk data pack with a different one, we must atomically read and write data here.

// Hence, the caller must acquire the [storage.LockInsertChunkDataPack] and hold it until the database write has been committed.

//

// No error returns are expected during normal operation.

storage/locks.go

AlexHentschel · 2025-09-30T05:17:26Z

storage/chunk_data_packs.go

 	return sc
 }
+
+func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) error {


⚠️ breaking with coding conventions

We are returning a sentinel error here, which is used here. No error documentation is provided. This is violating our coding conventions.

I find that this signature is breaking with widely-used patterns: Equals returns typically a boolean.

Options:

change return type to boolean

if you want a convenience function that returns an error, I would suggest to name it EnsureEquals

Suggested change

func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) error {

// EnsureEquals passes if and only if `other` is equal to the receiver.

// Otherwise, a `storage.ErrDataMismatch` is returned.

func (c StoredChunkDataPack) EnsureEquals(other StoredChunkDataPack) error {

if !c.Equals(other) {

return ErrDataMismatch

}

return nil

}

// Equals returns true if and only if `other` is equal to the receiver.

func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) bool {

I see the reason for using error here: conceptually we want to return a boolean and some context to include in an error message if necessary. Using a boolean only strips all context from the resulting error.

The problem with error is that it usually means something else. What about returning bool, string? If you return false, you can optionally include some extra context in the string. Otherwise it can be ignored.

AlexHentschel · 2025-09-30T05:23:32Z

storage/operation/chunk_data_packs.go

+		err := c.Equals(existing)
+		if err != nil {


⚠️

We are interpreting all errors without further differentiation as mismatch -- without any supporting documentation of Equals. This is violating our coding conventions.

I would suggest to change Equals to the established signature returning a boolean.

Agree with improving docs.

The reason I return error instead of boolean is so that the error can include the actual field that is mismatching including both values, it is useful for debugging.

Otherwise, a boolean would only tell they are different without telling why. And we might miss the chance to print the difference if we didn't save both value, and even if we print both value, it can be hard to find the actual mismatching field, especially like chunk data pack, which is usually very large.

AlexHentschel · 2025-09-30T05:45:21Z

storage/store/chunk_data_packs_test.go

 func TestChunkDataPacks_StoreTwice(t *testing.T) {
-	WithChunkDataPacks(t, 2, func(t *testing.T, chunkDataPacks []*flow.ChunkDataPack, chunkDataPackStore *store.ChunkDataPacks, pdb *pebble.DB) {
+	WithChunkDataPacks(t, 2, func(t *testing.T, chunkDataPacks []*flow.ChunkDataPack, chunkDataPackStore *store.ChunkDataPacks, pdb *pebble.DB, lockManager storage.LockManager) {
 		db := pebbleimpl.ToDB(pdb)


I would move this line into WithChunkDataPacks

Done in #7939

storage/store/chunk_data_packs_test.go

Co-authored-by: Alexander Hentschel <[email protected]>

jordanschalm · 2025-09-30T19:27:45Z

engine/execution/state/state.go

-	if err != nil {
-		return fmt.Errorf("can not store multiple chunk data pack: %w", err)
-	}
+	// Acquire both locks to ensure it's concurrent safe when inserting the execution results and chunk data packs.


They're in different databases, so we can't commit them atomically in one write batch.

jordanschalm · 2025-09-30T22:21:55Z

storage/chunk_data_packs.go

 	return sc
 }
+
+func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) error {


I see the reason for using error here: conceptually we want to return a boolean and some context to include in an error message if necessary. Using a boolean only strips all context from the resulting error.

The problem with error is that it usually means something else. What about returning bool, string? If you return false, you can optionally include some extra context in the string. Otherwise it can be ignored.

storage/locks.go

Co-authored-by: Jordan Schalm <[email protected]> Co-authored-by: Alexander Hentschel <[email protected]>

zhangchiqing commented Sep 22, 2025

View reviewed changes

zhangchiqing force-pushed the leo/refactor-insert-chunk-data-pack branch from 494707f to ceb03e1 Compare September 22, 2025 21:50

zhangchiqing commented Sep 23, 2025

View reviewed changes

zhangchiqing marked this pull request as ready for review September 23, 2025 01:35

zhangchiqing requested a review from a team as a code owner September 23, 2025 01:35

zhangchiqing requested review from AlexHentschel and jordanschalm September 23, 2025 01:35

jordanschalm reviewed Sep 24, 2025

View reviewed changes

zhangchiqing added 7 commits September 25, 2025 09:00

refactor insert chunk data packs

bcf3aec

fix tests

0b433f4

make saveExecutionResult concurrent safe

d3b88e3

rename to StoreByChunkID

bd219c4

update mocks

a7a7af3

Merge branch 'master' into leo/refactor-insert-chunk-data-pack

6634577

fix tests for renaming

e671f10

zhangchiqing force-pushed the leo/refactor-insert-chunk-data-pack branch from 3fef499 to e671f10 Compare September 25, 2025 16:08

zhangchiqing added 3 commits September 25, 2025 09:10

fix tests

48d0c08

fix lint

a3a701b

fix tests

155daed

AlexHentschel reviewed Sep 26, 2025

View reviewed changes

AlexHentschel reviewed Sep 30, 2025

View reviewed changes

AlexHentschel approved these changes Sep 30, 2025

View reviewed changes

zhangchiqing mentioned this pull request Sep 30, 2025

[Storage] Refactor stored chunk data pack #7983

Merged

zhangchiqing and others added 3 commits September 30, 2025 11:00

Update storage/locks.go

34dfda6

Co-authored-by: Alexander Hentschel <[email protected]>

Update storage/store/chunk_data_packs_test.go

7cb827d

Co-authored-by: Alexander Hentschel <[email protected]>

use Equals compare chunk data pack data

a3324d6

jordanschalm approved these changes Sep 30, 2025

View reviewed changes

zhangchiqing and others added 2 commits September 30, 2025 16:26

Apply suggestions from code review

3a0dbcd

Co-authored-by: Jordan Schalm <[email protected]> Co-authored-by: Alexander Hentschel <[email protected]>

Merge branch 'master' into leo/refactor-insert-chunk-data-pack

c05cee2

zhangchiqing enabled auto-merge September 30, 2025 23:35

zhangchiqing added this pull request to the merge queue Sep 30, 2025

Merged via the queue into master with commit ca73541 Oct 1, 2025
57 checks passed

zhangchiqing deleted the leo/refactor-insert-chunk-data-pack branch October 1, 2025 00:14

	if !c.SystemChunk && c.CollectionID != other.CollectionID {
	if c.CollectionID != other.CollectionID {

	err := s.chunkDataPacks.StoreByChunkID(lctx, chunks)
	if err != nil {
	return fmt.Errorf("can not store multiple chunk data pack: %w", err)
	}

	// Save entire execution result (including all chunk data packs) within one batch to minimize
	// the number of database interactions.
	return s.db.WithReaderBatchWriter(func(batch storage.ReaderBatchWriter) error {

	// Remove removes multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
	// Remove multiple ChunkDataPacks with the given chunk IDs (in a batch).

		// Store stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
		// No errors are expected during normal operation, but it may return generic error

-	// Store stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
-	// No errors are expected during normal operation, but it may return generic error
+	// StoreByChunkID persists multiple [flow.ChunkDataPack]s keyed by their ChunkIDs in a batch.
+	//
+	// CAUTION:
+	//   - The chunk ID is _not_ a collision resistant hash of the chunk data pack. For any given chunk, an Execution Node should
+	//     always provide the _same_ chunk data pack (slashable protocol violation otherwise). To prevent bugs in higher-lever logic
+	//     from accidentally overwriting an existing chunk data pack with a different one, we must atomically read and write data here.
+	//     Hence, the caller must acquire the [storage.LockInsertChunkDataPack] and hold it until the database write has been committed.
+	//
+	// No error returns are expected during normal operation.

	// LockInsertChunkDataPack protects the insertion of chunk data packs (not yet used anywhere
	// LockInsertChunkDataPack protects the insertion of chunk data packs

	return fmt.Errorf("can not store multiple chunk data pack: %w", err)
	return fmt.Errorf("failed to persist batch of chunk data packs for block %v: %w", blockID, err)

		// TODO(leo): when retrieving chunk data packs, we need to add a check to ensure the block
		// has been executed before returning chunk data packs

	// Acquire both locks to ensure it's concurrent safe when inserting the execution results and chunk data packs.
	// Acquire both locks because we want to persist the execution results and chunk data packs atomically in one write batch.

		// StoreByChunkID stores multiple ChunkDataPacks cs keyed by their ChunkIDs in a batch.
		// No errors are expected during normal operation, but it may return generic error

-func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) error {
+// EnsureEquals passes if and only if `other` is equal to the receiver.
+// Otherwise, a `storage.ErrDataMismatch` is returned.
+func (c StoredChunkDataPack) EnsureEquals(other StoredChunkDataPack) error {
+	if !c.Equals(other) {
+		return ErrDataMismatch
+	}
+	return nil
+}
+// Equals returns true if and only if `other` is equal to the receiver.
+func (c StoredChunkDataPack) Equals(other StoredChunkDataPack) bool {

Uh oh!

[Storage] Refactor insert chunk data pack #7939

[Storage] Refactor insert chunk data pack #7939

Uh oh!

Conversation

zhangchiqing commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexHentschel Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Rollback error handling can silently swallow errors indicating data corruption

Concurrency safety of rollback mechanism

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlexHentschel Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexHentschel Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexHentschel Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangchiqing commented Sep 22, 2025 •

edited

Loading

AlexHentschel Sep 26, 2025 •

edited

Loading

codecov-commenter commented Sep 22, 2025 •

edited

Loading

AlexHentschel Sep 26, 2025 •

edited

Loading

AlexHentschel Sep 26, 2025 •

edited

Loading

AlexHentschel Sep 26, 2025 •

edited

Loading