[Storage] Refactor indexing transaction error message #8021

zhangchiqing · 2025-10-08T04:19:50Z

Work towards #7912

Making BatchInsertTransactionResultErrorMessage and BatchIndexTransactionResultErrorMessage private, and merged them into one function BatchInsertAndIndexTransactionResultErrorMessage, so that existence check can be included there.

codecov-commenter · 2025-10-08T04:23:41Z

Codecov Report

❌ Patch coverage is 32.19178% with 99 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
storage/operation/transaction_results.go	0.00%	50 Missing ⚠️
...isters/stores/transaction_result_error_messages.go	0.00%	10 Missing ⚠️
...async/optimistic_sync/persisters/stores/results.go	0.00%	9 Missing ⚠️
...dule/state_synchronization/indexer/indexer_core.go	55.00%	6 Missing and 3 partials ⚠️
...tasync/optimistic_sync/persisters/stores/events.go	0.00%	8 Missing ⚠️
storage/mock/transaction_result_error_messages.go	0.00%	8 Missing ⚠️
storage/mock/light_transaction_results.go	0.00%	4 Missing ⚠️
cmd/access/node_builder/access_node_builder.go	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

zhangchiqing · 2025-10-08T16:14:20Z

module/executiondatasync/optimistic_sync/persisters/stores/results.go


 	if len(results) > 0 {
-		if err := r.persistedResults.BatchStore(r.blockID, results, batch); err != nil {
+		// Use storage.WithLock to acquire the necessary lock and store the results


@peterargue could you confirm this behavior?

I'm not sure if in the past, we let AN overwrite any data in the scenario of recovery or backfilling.

But the current behavior now is that it will never overwrite, is that OK?

we generally do not allow rewriting, but we should make attempts to index the most recently indexed block idempotent. I think that is/should be handled by the indexer though, not storage.

zhangchiqing · 2025-10-08T16:14:28Z

module/executiondatasync/optimistic_sync/persisters/stores/transaction_result_error_messages.go

-		if err := t.persistedTxResultErrMsg.BatchStore(t.blockID, txResultErrMsgs, batch); err != nil {
-			return fmt.Errorf("could not add transaction result error messages to batch: %w", err)
-		}
+	err = storage.SkipAlreadyExistsError( // Note: if the data already exists, we will not overwrite


@peterargue could you confirm this behavior?

I'm not sure if in the past, we let AN overwrite any data in the scenario of recovery or backfilling.

But the current behavior now is that it will never overwrite, is that OK?

we explicitly check before looking up the data, so the intention is that we will skip existing entries.

peterargue · 2025-10-08T16:47:13Z

module/executiondatasync/optimistic_sync/persisters/stores/results.go

 	if len(results) > 0 {
-		if err := r.persistedResults.BatchStore(r.blockID, results, batch); err != nil {
+		// Use storage.WithLock to acquire the necessary lock and store the results
+		err := storage.WithLock(r.lockManager, storage.LockInsertLightTransactionResult, func(lctx lockctx.Context) error {


won't this release the lock after it returns? shouldn't it hold the lock for the entire batch?

peterargue · 2025-10-08T16:51:03Z

module/executiondatasync/optimistic_sync/persisters/stores/transaction_result_error_messages.go

-			return fmt.Errorf("could not add transaction result error messages to batch: %w", err)
-		}
+	err = storage.SkipAlreadyExistsError( // Note: if the data already exists, we will not overwrite
+		storage.WithLock(t.lockManager, storage.LockInsertTransactionResultErrMessage, func(lctx lockctx.Context) error {


peterargue · 2025-10-08T16:58:24Z

module/executiondatasync/optimistic_sync/persisters/stores/results.go


 	if len(results) > 0 {
-		if err := r.persistedResults.BatchStore(r.blockID, results, batch); err != nil {
+		// Use storage.WithLock to acquire the necessary lock and store the results


we generally do not allow rewriting, but we should make attempts to index the most recently indexed block idempotent. I think that is/should be handled by the indexer though, not storage.

peterargue · 2025-10-08T16:59:20Z

module/executiondatasync/optimistic_sync/persisters/stores/transaction_result_error_messages.go

-		if err := t.persistedTxResultErrMsg.BatchStore(t.blockID, txResultErrMsgs, batch); err != nil {
-			return fmt.Errorf("could not add transaction result error messages to batch: %w", err)
-		}
+	err = storage.SkipAlreadyExistsError( // Note: if the data already exists, we will not overwrite


we explicitly check before looking up the data, so the intention is that we will skip existing entries.

zhangchiqing · 2025-10-08T17:24:56Z

module/executiondatasync/optimistic_sync/persisters/block.go

-				return err
+	err := storage.WithLocks(p.lockManager, []string{
+		storage.LockInsertCollection,
+		storage.LockInsertLightTransactionResult,


I don't like this usage here, that the block persister has to know what locks needs to be acquired. The block persister is supposed to be ignorant about what db operation the underlying individual persisters running, therefore doesn't know about the locks to acquire, it supposed to only create a batch object, and ensure it's committed.

Maybe we can revisit this, when working on #7910. My idea is that we could let the BatchStore functor to return a required lock ID and the functor, so that the locker doesn't need to remember what lock to acquire.

I agree it's slightly awkward, but I think returning the expected lock alongside the functor could cause other problems:

what if many functors return the same lock to the ignorant caller? The caller would need to be able to ensure it acquires each lock only once

if there are many different locks to acquire, we still need to carefully order persisterStores to make sure the locks are acquired in the right order

Fundamentally, the layer at which locks are acquired does need to know something about what locks are being acquired. I think just having the upper layer explicitly acquire the needed locks is the simplest way to deal with this pattern.

[Leo] I don't like this usage here, that the block persister has to know what locks needs to be acquired.

my 10 cents on this conversation:

On the one hand, acquiring the locks one by one looks verbose. But I don't think this is a problem of the lock proof pattern. Here are my reasonings:

The current BlockPersister implementation already very precisely documents that is is for persisting an execution result

flow-go/module/executiondatasync/optimistic_sync/persisters/block.go

Line 16 in dd6cfa2

// BlockPersister stores execution data for a single execution result into the database.

Therefore, it is nothing surprising that BlockPersister also has to know which locks to acquire (all that requires for persisting a result)

With moving to pebble, the lower-level storage layer has not longer the ability to self-sufficiently protect against illegal data changes. Now, the business logic has be be involved by acquiring and holding locks.

Components that require different locks simply don't satisfy the same API anymore. Higher-level business logic has to be aware which locks to acquire, that's simply a consequence of the storage layer no longer having snapshot isolation for read + writes.

From my perspective, you can try to hide the reality that the type of lock that must be held is conceptually part of the interface (even through the compiler only enforces that some lock proof is given, but is oblivious about the lock). I think thereby you inadvertently create problems in other parts that then no longer have the information they need:

order of locks cannot be guaranteed if no single party is responsible for acquiring the locks in one place

more complicated implementation of checks, when locks are no longer required to be held at the time of call, but only at the time the batch is committed

…/store/transaction_result_error_messages_test.go`

… the DB

AlexHentschel · 2025-10-10T23:14:11Z

storage/operation/transaction_results.go

 // No errors are expected during normal operation, but it may return generic error
 // if badger fails to process request


still have "badger" in comment

Suggested change

// No errors are expected during normal operation, but it may return generic error

// if badger fails to process request

// No error returns are expected during normal operations.

AlexHentschel · 2025-10-10T23:20:21Z

storage/operation/transaction_results.go

+		err := insertLightTransactionResult(w, blockID, &result)
+		if err != nil {
+			return fmt.Errorf("cannot batch insert light tx result: %w", err)
+		}
+
+		err = indexLightTransactionResultByBlockIDAndTxIndex(w, blockID, uint32(i), &result)
+		if err != nil {
+			return fmt.Errorf("cannot batch index light tx result: %w", err)
+		}


I would suggest to just inline the calls UpsertByKey and the respective lines of documentation here.

AlexHentschel · 2025-10-10T23:25:49Z

storage/operation/transaction_results.go

+		err := insertTransactionResultErrorMessageByTxID(w, blockID, &result)
+		if err != nil {
+			return fmt.Errorf("cannot batch insert tx result error message: %w", err)
+		}
+
+		err = indexTransactionResultErrorMessageBlockIDTxIndex(w, blockID, &result)
+		if err != nil {
+			return fmt.Errorf("cannot batch index tx result error message: %w", err)
+		}


also here, I would suggest to inline insertTransactionResultErrorMessageByTxID and indexTransactionResultErrorMessageBlockIDTxIndex (including the goDoc unless it is trivially apparent).

AlexHentschel · 2025-10-10T23:59:22Z

storage/locks.go

 		LockBootstrapping,
 		LockInsertChunkDataPack,
+		LockInsertTransactionResultErrMessage,
+		LockInsertLightTransactionResult,


whether or not we persist the transaction result in its light representation or some other representation is in my opinion irrelevant for the name of the lock. I would consider "light" an implementation detail of the storage layer and hence would suggest to remove this word from the lock:

Suggested change

LockInsertLightTransactionResult,

LockInsertTransactionResult,

I'd like to keep as is. IMO, the lock is table based. Each table should have its own lock in order to synchronize writes to the same table. And different tables technically don't use the same lock, if they do, the synchronization might be done in the application layer, instead of storage layer.

AlexHentschel · 2025-10-11T00:01:02Z

storage/locks.go

+	// LockInsertLightTransactionResult protects the insertion of light transaction results
+	LockInsertLightTransactionResult = "lock_insert_light_transaction_result"


Follow-up on removing the word "light":

Suggested change

// LockInsertLightTransactionResult protects the insertion of light transaction results

LockInsertLightTransactionResult = "lock_insert_light_transaction_result"

// LockInsertTransactionResult protects the insertion of transaction results

LockInsertTransactionResult = "lock_insert_transaction_result"

AlexHentschel · 2025-10-11T02:21:09Z

module/executiondatasync/optimistic_sync/persisters/stores/events.go

-	if err := e.persistedEvents.BatchStore(e.blockID, []flow.EventsList{e.data}, batch); err != nil {
+	err := e.persistedEvents.BatchStore(e.blockID, []flow.EventsList{e.data}, batch)
+	if err != nil {
+		if errors.Is(err, storage.ErrAlreadyExists) {


❓

Question

I assume this is anticipating future changes? If I am not mistaken, at the moment, Events.BatchStore does not error. It just ignorantly overwrites the data.

In addition, the storage API is lacking error documentation, so added the following todo:

flow-go/storage/events.go

Lines 29 to 31 in 9a21aea

// BatchStore will store events for the given block ID in a given batch

// TODO: error documentation

BatchStore(blockID flow.Identifier, events []flow.EventsList, batch ReaderBatchWriter) error

Overall, I think we need to come back to the events and add overwrite protection checks. After adding these checks, the code in the EventsStore here will probably be correct. Created issue #8034 ... maybe worth-while to link the issue here in the code

Yes, the refactor of the implementation has been addressed in this PR:
https://github.com/onflow/flow-go/pull/8005/files#diff-db1f3e68113391168d7f6bf516cfc5b9c2ac5c0d0f106751171f48b4da7d7678R19

So the error handling is added here first.

AlexHentschel · 2025-10-11T02:53:57Z

engine/access/ingestion/tx_error_messages/tx_error_messages_core_test.go

 	// Mock the storage of the fetched error messages into the protocol database.
-	s.txErrorMessages.On("Store", blockId, expectedStoreTxErrorMessages).
+	s.txErrorMessages.On("Store", mock.Anything, blockId, expectedStoreTxErrorMessages).
 		Return(nil).Once()


should check that the required lock is held here for any TransactionResultErrorMessages.Store

Suggested change

Return(nil).Once()

Return(func(lctx lockctx.Proof, blockID flow.Identifier, transactionResultErrorMessages []flow.TransactionResultErrorMessage) error {

require.True(s.T(), lctx.HoldsLock(storage.LockInsertTransactionResultErrMessage))

return nil

}).Once()

AlexHentschel · 2025-10-11T02:55:16Z

engine/access/ingestion/tx_error_messages/tx_error_messages_core_test.go

 		expectedStoreTxErrorMessages := createExpectedTxErrorMessages(resultsByBlockID, s.enNodeIDs.NodeIDs()[0])
-		s.txErrorMessages.On("Store", blockId, expectedStoreTxErrorMessages).
+		s.txErrorMessages.On("Store", mock.Anything, blockId, expectedStoreTxErrorMessages).
 			Return(fmt.Errorf("storage error")).Once()


Suggested change

Return(fmt.Errorf("storage error")).Once()

Return(func(lctx lockctx.Proof, blockID flow.Identifier, transactionResultErrorMessages []flow.TransactionResultErrorMessage) error {

require.True(s.T(), lctx.HoldsLock(storage.LockInsertTransactionResultErrMessage))

return fmt.Errorf("storage error")

}).Once()

AlexHentschel · 2025-10-11T03:10:25Z

engine/access/ingestion/tx_error_messages/tx_error_messages_engine_test.go

 			Run(func(args mock.Arguments) {
 				// Ensure the test does not complete its work faster than necessary
 				wg.Done()


lets check correct lock is held:

Suggested change

Run(func(args mock.Arguments) {

// Ensure the test does not complete its work faster than necessary

wg.Done()

Run(func(args mock.Arguments) {

lctx, ok := args[0].(lockctx.Proof)

require.True(s.T(), ok, "expecting lock proof, but cast failed")

require.True(s.T(), lctx.HoldsLock(storage.LockInsertTransactionResultErrMessage))

wg.Done() // Ensure the test does not complete its work faster than necessary

AlexHentschel · 2025-10-11T03:12:20Z

module/executiondatasync/optimistic_sync/core_impl_test.go

+		c.persistentCollections.On("BatchStoreAndIndexByTransaction", mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(nil, nil)
+		c.persistentResults.On("BatchStore", mock.Anything, mock.Anything, blockID, indexerData.Results).Return(nil)
+		c.persistentTxResultErrMsg.On("BatchStore", mock.Anything, mock.Anything, blockID, core.workingData.txResultErrMsgsData).Return(nil)


assume here most of the newly added mock.Anything are for lockctx.Proof (?) Could you add checks that the respective locks are held please. Thanks

…nto leo/refactor-index-tx-err-msg

zhangchiqing added 9 commits October 7, 2025 17:12

refactor index tx error message

af1c3e6

refactor index tx error message

d86569f

fix store tests

af54bfe

refactor tx error message

b413203

add test case

a1a371d

fix tests

157d4ec

fix tests

43ef330

fix tests

155220d

fix lint

e666a2a

zhangchiqing added 2 commits October 8, 2025 08:48

fix admin tests

5277616

handle already exists error

28b4f19

zhangchiqing commented Oct 8, 2025

View reviewed changes

fix tests

bbd6ed9

zhangchiqing mentioned this pull request Oct 8, 2025

[DataAvailability] Refactor pipeline and core to make testing easier #8011

Open

zhangchiqing added 2 commits October 8, 2025 09:49

handle already exists error

af6427a

fix optimisic sync core store operation

4eb3b45

peterargue reviewed Oct 8, 2025

View reviewed changes

zhangchiqing added 2 commits October 8, 2025 10:12

fix optimistic sync persister block

45524d0

fix tests

1ee701b

zhangchiqing commented Oct 8, 2025

View reviewed changes

zhangchiqing added 6 commits October 8, 2025 11:31

Merge branch 'master' into leo/refactor-index-tx-err-msg

79751ac

fix tests

e2039ac

fix optimistic sync core

c9563fd

fix mocks

6dddf0a

fix mocks

346cdae

fix tests

f07a652

zhangchiqing marked this pull request as ready for review October 8, 2025 20:04

zhangchiqing requested a review from a team as a code owner October 8, 2025 20:04

AlexHentschel added 12 commits October 10, 2025 17:22

added test

cafb925

more error documentation

ac316dd

polishing doc

64dfc09

added minor documentation and additional error check to test `storage…

4ac5c28

…/store/transaction_result_error_messages_test.go`

polishing goDoc

ac296c1

added deprecation notice

9d0cbdb

polished goDoc

51a1895

polished goDoc

c20915b

added cautionary statement regarding missing error doc

9a21aea

extending documentation

127db87

moved Testing Lock manager's initialization very close to the init of…

015edcc

… the DB

marginal code consolidation

73c6514

AlexHentschel approved these changes Oct 11, 2025

View reviewed changes

zhangchiqing added 11 commits October 14, 2025 08:59

remove review comments

9d9a747

Merge remote-tracking branch 'origin/leo/refactor-index-tx-err-msg' i…

6f77ad3

…nto leo/refactor-index-tx-err-msg

remove deprecated methods

4854931

update comments

a19f958

add tests for light transaction results store

348284d

check locks held in tests

7636d0b

update optimistic sync core impl tests to verify locks held

e645c88

Merge branch 'master' into leo/refactor-index-tx-err-msg

fa97bbb

fix mocks

94a08b5

update tests

ee7a627

fix lint

c51b871

zhangchiqing enabled auto-merge October 15, 2025 16:03

Merge branch 'master' into leo/refactor-index-tx-err-msg

b7918aa

zhangchiqing added this pull request to the merge queue Oct 15, 2025

Merged via the queue into master with commit a745d4e Oct 15, 2025
108 of 114 checks passed

zhangchiqing deleted the leo/refactor-index-tx-err-msg branch October 15, 2025 17:32

		// No errors are expected during normal operation, but it may return generic error
		// if badger fails to process request

	// No errors are expected during normal operation, but it may return generic error
	// if badger fails to process request
	// No error returns are expected during normal operations.

	LockInsertLightTransactionResult,
	LockInsertTransactionResult,

		// LockInsertLightTransactionResult protects the insertion of light transaction results
		LockInsertLightTransactionResult = "lock_insert_light_transaction_result"

	// BatchStore will store events for the given block ID in a given batch
	// TODO: error documentation
	BatchStore(blockID flow.Identifier, events []flow.EventsList, batch ReaderBatchWriter) error

-		Return(nil).Once()
+		Return(func(lctx lockctx.Proof, blockID flow.Identifier, transactionResultErrorMessages []flow.TransactionResultErrorMessage) error {
+			require.True(s.T(), lctx.HoldsLock(storage.LockInsertTransactionResultErrMessage))
+			return nil
+		}).Once()

[Storage] Refactor indexing transaction error message #8021

[Storage] Refactor indexing transaction error message #8021

Uh oh!

Conversation

zhangchiqing commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexHentschel Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

❓

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhangchiqing commented Oct 8, 2025 •

edited

Loading

codecov-commenter commented Oct 8, 2025 •

edited

Loading

AlexHentschel Oct 11, 2025 •

edited

Loading