go/worker/storage: Refactor state sync worker #6299

martintomazic · 2025-08-19T11:30:54Z

Motivation:
Whilst working on #6239, it became clear the runtime state sync could be optimized. More importantly the worker in the current form is very hard to read, maintain and reason about.

What was done:
This PR improves some of the existing issues, such as panicking, and unlikely goroutine leaks upon termination/cleanup. More importantly it sets the stage for further refactors.

Please consider this as incremental improvement. The code after this refactor is still far from optimal in terms of readability and maintenance.

Follow-up:

Make diff sync independent worker, that could be tested and maintained in isolation.
Optimize state sync (if possible and sensible).

update: I combined the two followups in #6242.

netlify · 2025-08-19T11:30:59Z

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Name	Link
🔨 Latest commit	`dca1f4b`
🔍 Latest deploy log	https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68a5bf210e24a50008ea797c

martintomazic · 2025-08-19T12:25:48Z

go/worker/storage/statesync/state_sync.go

+// The request relies on the default timeout of the underlying p2p protocol clients.
+//


How about we move this constant out of the p2p protocol client and make timeout responsibility of the client?

Which constant? This comment doesn't belong here, as a method should not rely on internal implementation of its parameters, only on their interface.

oasis-core/go/worker/storage/p2p/diffsync/client.go

Line 34 in 0ff844e

rpc.WithMaxPeerResponseTime(MaxGetDiffResponseTime),

I suggest to remove MaxGetDiffResponseTime from the p2p package and make client responsibility to define the context timeout. This is how is currently done for fetching chunks and is imo also idiomatic/correct.

Then I can remove all the comments that about this constant that you correctly pointed out were off?

Yes, I guess you could remove it. But current solution is also fine in general, the p2p layer could define its own timeout and state sync could lower it with context if needed.

Talking about internal implementation of another struct in comment is probably not the best, so I would remove it. Or instead of the comment, I would use context with deadline and use a similar timeout.

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

codecov · 2025-08-19T13:00:57Z

Codecov Report

❌ Patch coverage is 78.75000% with 170 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.88%. Comparing base (26b367e) to head (58d1c6a).

Files with missing lines	Patch %	Lines
go/worker/storage/statesync/state_sync.go	80.55%	45 Missing and 18 partials ⚠️
go/worker/storage/statesync/checkpointer.go	65.89%	34 Missing and 10 partials ⚠️
go/worker/storage/statesync/diff_sync.go	87.79%	25 Missing and 6 partials ⚠️
go/worker/storage/statesync/checkpoint_sync.go	59.52%	12 Missing and 5 partials ⚠️
go/worker/storage/statesync/prune.go	36.36%	13 Missing and 1 partial ⚠️
go/oasis-node/cmd/node/node_control.go	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6299      +/-   ##
==========================================
+ Coverage   64.63%   64.88%   +0.25%     
==========================================
  Files         696      699       +3     
  Lines       67803    67765      -38     
==========================================
+ Hits        43824    43969     +145     
+ Misses      19013    18778     -235     
- Partials     4966     5018      +52

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Logic was preserved, the only thing that changed is that context is passed explicitly and worker for creating checkpoints was renamed.

In addition state sync worker should return an error and it should be the caller responsibility to act accordingly. See e.g. new workers such as stateless client. Note that semantic changed slightly: Previously storage worker would wait for all state sync workers to finish. Now it will terminate when the first one finishes. Notice that this is not 100% true as previously state sync worker could panic (which would in that case shutdown the whole node).

Probably the timeout should be the client responsibility.

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

The code was broken into smaller functions. Also the scope of variables (including channels) have been reduced. Semantics as well as performance should stay the same.

The logic was preserved. Ideally, diff sync would only accept context, local storage backend, and client/interface to fetch diff. This would make it testable in isolation. Finally, use of undefined round should be moved out of it.

Previously, if the worker returned an error it would exit main for loop and wait for the waitgroup to be emptied. However, this is not possible as there is no one that is reading the fetched diffs.

In case of termination due to error exiting main for loop or canceled context there is no point in waiting for go routines to finish fetching/doing the cleanup. As long we cancel the context for them and use it properly in the select statements this should be safe and better.

peternose · 2025-08-20T08:51:29Z

go/worker/storage/statesync/state_sync.go

+//  5. Registering node availability when it has synced sufficiently close to
+//     the latest known block header.
+//
+// Suggestion: This worker should not be responsible for creating and advertising p2p related stuff.


I would remove this since it’s already obvious from a "good programming" perspective. If this is a TODO, then the comment should be a bit different.

go/worker/storage/statesync/state_sync.go

peternose · 2025-08-20T10:42:48Z

go/worker/storage/statesync/state_sync.go

 	}

-	n.logger.Info("starting committee node")
+	w.logger.Info("starting state sycne worker")


Suggested change

w.logger.Info("starting state sycne worker")

w.logger.Info("starting state sync worker")

I prefer only starting, as the logger should in general contain the name of the worker.

peternose · 2025-08-20T10:45:42Z

go/worker/storage/worker.go

 	}

-	// Start storage node for every runtime.
+	// Start state sync worker for every runtime.


Can you improve this comment, as is doesn't make sense since we only register runtime.

go/worker/storage/statesync/prune.go

peternose · 2025-08-21T07:24:54Z

go/worker/storage/statesync/checkpoint_sync.go

 			}
 		}

+		// Suggestion: Limit the max time for restoring checkpoint.


Isn't this a TODO and could be done now?

I thought about this yes.

Any idea what would be a good context timeout that we don't shoot ourselves in the foot if the state becomes bigger and bigger and thus restoration longer...

Definitely would prefer to not make this configurable, but we should have sufficient extra time. I think currently it takes around 10-20 min on my machine to restore from the checkpoint...

I have no idea. You can start with a loose timeout and we can stricken it latter.

peternose · 2025-08-21T07:27:18Z

go/worker/storage/statesync/state_sync.go

+// The request relies on the default timeout of the underlying p2p protocol clients.
+//


Which constant? This comment doesn't belong here, as a method should not rely on internal implementation of its parameters, only on their interface.

peternose · 2025-08-21T07:29:03Z

go/worker/storage/statesync/state_sync.go

-						"current_round", blk.Header.Round,
-					)
-					panic("can't get block in storage worker")
+					return fmt.Errorf("getting block for round %d (current round: %d): %w", i, blk.Header.Round, err)


Should this be failed to ...

peternose · 2025-08-21T07:34:47Z

go/worker/storage/statesync/state_sync.go

-func (w *Worker) worker() { // nolint: gocyclo
-	defer close(w.quitCh)
+// Run runs state sync worker.
+func (w *Worker) Run(ctx context.Context) error { // nolint: gocyclo


Rename to Serve to implement Service interface.

go/worker/storage/statesync/state_sync.go

peternose · 2025-08-21T08:21:59Z

go/worker/storage/worker.go

-		for _, r := range w.runtimes {
-			<-r.Quit()
-		}
+		_ = w.Serve() // error logged as part of Serve already.


You should move everything from Start and Stop to Serve, so that we can latter replace the service manager with the one that accepts Services.

I see, I focused on making the common committee node stateless with regards to context, will also do explicit context passing for the storage worker here. +1.

peternose

Even though you refactored the state sync worker, I think that there is still a lot of things to be done as the code is still very unclear in functions are way too long. I would try to break the worker into sub-workers, to break down the code into smaller pieces which are easier to understand.

peternose · 2025-08-21T10:38:12Z

go/worker/storage/statesync/checkpointer.go

+		return
+	}
+
+	// Wait for the common node to be initialized.


This can be removed.

peternose · 2025-08-21T14:26:31Z

go/worker/storage/statesync/state_sync.go

 	summaryCache := make(map[uint64]*blockSummary)
-	// Create the fetcher pool.
+	pendingApply := &minRoundQueue{}
+	pendingFinalize := &minRoundQueue{} // Suggestion: slice would suffice given that application must happen in order.


This comment appears to make sense only from the refactorer's perspective, and may be confusing to others. I recommend removing it or addressing the slice change in a separate commit.

peternose · 2025-08-21T14:27:53Z

go/worker/storage/statesync/state_sync.go

-			}
-		}
-	}
+	heartbeat := heartbeat{}


Suggested change

heartbeat := heartbeat{}

var heartbeat heartbeat

peternose · 2025-08-21T14:31:17Z

go/worker/storage/statesync/state_sync.go

-					lastDiff.pf.RecordSuccess()
-				}
-			}
+			err := w.apply(ctx, lastDiff)


Move 1 line down and into if.

peternose · 2025-08-21T14:32:20Z

go/worker/storage/statesync/state_sync.go

 			delete(summaryCache, lastDiff.round-1)
 			lastFullyAppliedRound = lastDiff.round

+			// Suggestion: Rename to lastAppliedRoundMetric, as synced is often synonim for finalized in this code.


These suggestions are the same as TODO. Either you do them or not.

go/worker/storage/statesync/state_sync.go

peternose · 2025-08-21T15:03:18Z

go/worker/storage/statesync/state_sync.go

-func (w *Worker) fetchDiff(ctx context.Context, round uint64, prevRoot, thisRoot storageApi.Root) {
+func (w *Worker) triggerRoundFetches(
+	ctx context.Context,
+	wg *sync.WaitGroup,


I avoid passing a wait groups and channels around because it makes it unclear which parts of the code are blocking/using it.

I would try to create two fetchers, a rounder and a differ, which would accept tasks and do their job. The rounder would internally use wait group, while the differ would expose diff channel.

I would try to create two fetchers, a rounder and a differ, which would accept tasks and do their job. The rounder would internally use wait group, while the differ would expose diff channel.

Can you elaborate a bit more?

The final refactor I had in mind was done here: c7f230b. See 3x g.Go(...). Arguably could be improved further.

If you can write few sentences about how many workers and high level responsibilities of each that would be awesome! :)

I was thinking something like this, plus maybe some additional structs for Apply and Finalize. Simple structs, as little fields as possible, easy to test.

Note that this code might have errors.

type Worker struct { nudger *availabilityNudger header *headerFetcher differ *diffFetcher ... }

// availabilityNudger tracks the progress of last and last synced rounds // and “nudges” role providers to mark themselves available or unavailable // based on how closely the node is keeping up with consensus. type availabilityNudger struct { roleProvider registration.RoleProvider rpcRoleProvider registration.RoleProvider roleAvailable bool lastRound uint64 lastSyncedRound uint64 } // newAvailabilityNudger creates a new availability nudger. func newAvailabilityNudger(localProvider, rpcProvider registration.RoleProvider) *availabilityNudger { return &availabilityNudger{ roleProvider: localProvider, rpcRoleProvider: rpcProvider, lastRound: math.MaxUint64, lastSyncedRound: math.MaxUint64, } } // setLastRound updates the last round number. func (m *availabilityNudger) setLastRound(round uint64) { m.lastRound = round } // setLastSyncedRound updates the most recently synced round number. func (m *availabilityNudger) setLastSyncedRound(round uint64) { m.lastSyncedRound = round } // updateAvailability updates the role's availability based on the gap // between the last round and the last synced round. func (m *availabilityNudger) updateAvailability() { if m.lastRound == math.MaxUint64 || m.lastSyncedRound == math.MaxUint64 { return } if m.lastRound > m.lastSyncedRound { return } switch roundLag := m.lastRound - m.lastSyncedRound; { case roundLag < maximumRoundDelayForAvailability: m.markAvailable() case roundLag > minimumRoundDelayForUnavailability: m.markUnavailable() } } // markAvailable sets the role as available if it is not already. func (m *availabilityNudger) markAvailable() { if m.roleAvailable { return } m.roleAvailable = true m.roleProvider.SetAvailable(func(*node.Node) error { return nil }) if m.rpcRoleProvider != nil { m.rpcRoleProvider.SetAvailable(func(*node.Node) error { return nil }) } } // markUnavailable sets the role as unavailable if it is currently available. func (m *availabilityNudger) markUnavailable() { if !m.roleAvailable { return } m.roleAvailable = false m.roleProvider.SetUnavailable() if m.rpcRoleProvider != nil { m.rpcRoleProvider.SetUnavailable() } }

// summaryCache is a concurrent-safe cache for block summaries. type summaryCache struct { mu sync.Mutex cache map[uint64]*blockSummary } // newSummaryCache creates a new summary cache. func newSummaryCache() *summaryCache { return &summaryCache{ cache: make(map[uint64]*blockSummary), } } // set adds the given summary to the cache. func (s *summaryCache) set(round uint64, summary *blockSummary) { s.mu.Lock() defer s.mu.Unlock() s.cache[round] = summary } // get returns a summary from the cache. func (s *summaryCache) get(round uint64) (*blockSummary, bool) { s.mu.Lock() defer s.mu.Unlock() summary, ok := s.cache[round] return summary, ok } // delete removes a summary from the cache. func (s *summaryCache) delete(round uint64) { s.mu.Lock() defer s.mu.Unlock() delete(s.cache, round) }

// headerFetcher is responsible for fetching block headers and populating the summary cache. type headerFetcher struct { history history.History summaries *summaryCache } func newHeaderFetcher(history history.History, summaries *summaryCache) *headerFetcher { return &headerFetcher{ history: history, summaries: summaries, } } // fetch fetches the block header for the given round and populates the summary cache. func (f *headerFetcher) fetch(ctx context.Context, round uint64, blk *block.Block) error { if _, ok := f.summaries.get(round); !ok && round == math.MaxUint64 { dummy := blockSummary{ Namespace: blk.Header.Namespace, Round: round + 1, Roots: []api.Root{ { Version: round + 1, Type: api.RootTypeIO, }, { Version: round + 1, Type: api.RootTypeState, }, }, } dummy.Roots[0].Empty() dummy.Roots[1].Empty() f.summaries.set(round, &dummy) } // Determine if we need to fetch any old block summaries. In case the first // round is an undefined round, we need to start with the following round // since the undefined round may be unsigned -1 and in this case the loop // would not do any iterations. startSummaryRound := round if startSummaryRound == math.MaxUint64 { startSummaryRound++ } for i := startSummaryRound; i < blk.Header.Round; i++ { if _, ok := f.summaries.get(i); ok { continue } oldBlock, err := f.history.GetCommittedBlock(ctx, i) if err != nil { return fmt.Errorf("getting block for round %d (current round: %d): %w", i, blk.Header.Round, err) } summary := summaryFromBlock(oldBlock) f.summaries.set(i, summary) } if _, ok := f.summaries.get(blk.Header.Round); !ok { summary := summaryFromBlock(blk) f.summaries.set(blk.Header.Round, summary) } return nil }

// diffFetcher is responsible for fetching storage diffs. type diffFetcher struct { diffSync diffsync.Client legacyStorageSync synclegacy.Client localStorage storageApi.LocalBackend syncingRounds map[uint64]*inFlight summaries *summaryCache pool *workerpool.Pool ch chan *fetchedDiff } // fetch fetches the storage diff for the given rounds. func (f *diffFetcher) fetch(ctx context.Context, start uint64, end uint64) { for round := start; round <= end; round++ { f.fetchRound(ctx, round) } } // fetch fetches the storage diff for the given round. func (f *diffFetcher) fetchRound(ctx context.Context, round uint64) { syncing, ok := f.syncingRounds[round] if !ok { if len(f.syncingRounds) >= maxInFlightRounds { return } syncing = &inFlight{ startedAt: time.Now(), awaitingRetry: outstandingMaskFull, } f.syncingRounds[round] = syncing } if syncing.outstanding.hasAll() { return } prev, _ := f.summaries.get(round - 1) curr, _ := f.summaries.get(round) prevRoots := make([]api.Root, len(prev.Roots)) copy(prevRoots, prev.Roots) for i := range prevRoots { if prevRoots[i].Type == api.RootTypeIO { // IO roots aren't chained, so clear it (but leave cache intact). prevRoots[i] = api.Root{ Namespace: curr.Namespace, Version: curr.Round, Type: api.RootTypeIO, } prevRoots[i].Hash.Empty() break } } for i := range prevRoots { rootType := prevRoots[i].Type if syncing.outstanding.contains(rootType) { continue } if !syncing.awaitingRetry.contains(rootType) { continue } syncing.scheduleDiff(rootType) f.pool.Submit(func() { f.retryDiff(ctx, curr.Round, prevRoots[i], curr.Roots[i]) }) } } // fetch fetches the storage diff for the given round and schedules a retry on failure. func (f *diffFetcher) retryDiff(ctx context.Context, round uint64, prevRoot, thisRoot api.Root) { diff, err := f.getDiff(ctx, round, prevRoot, thisRoot) if err != nil { f.syncingRounds[round].retry(thisRoot.Type) return } select { case f.ch <- diff: case <-ctx.Done(): } } // getDiff fetches the storage diff for the local storage or remote peers. func (f *diffFetcher) getDiff(ctx context.Context, round uint64, prevRoot, thisRoot api.Root) (*fetchedDiff, error) { result := &fetchedDiff{ pf: rpc.NewNopPeerFeedback(), round: round, prevRoot: prevRoot, thisRoot: thisRoot, } // Check if the new root doesn't already exist. if !f.localStorage.NodeDB().HasRoot(thisRoot) { return result, nil } result.fetched = true // Even if HasRoot returns false the root can still exist if it is equal // to the previous root and the root was emitted by the consensus committee // directly (e.g., during an epoch transition). if thisRoot.Hash.Equal(&prevRoot.Hash) { result.writeLog = api.WriteLog{} return result, nil } wl, pf, err := f.fetchDiff(ctx, prevRoot, thisRoot) if err != nil { return nil, err } result.pf = pf result.writeLog = wl return result, nil } // fetchDiff fetches writelog using diff sync p2p protocol client. // // In case of no peers or error, it fallbacks to the legacy storage sync protocol. func (w *diffFetcher) fetchDiff(ctx context.Context, start, end api.Root) (api.WriteLog, rpc.PeerFeedback, error) { rsp1, pf, err := w.diffSync.GetDiff(ctx, &diffsync.GetDiffRequest{ StartRoot: start, EndRoot: end, }) if err == nil { // if NO error return rsp1.WriteLog, pf, nil } rsp2, pf, err := w.legacyStorageSync.GetDiff(ctx, &synclegacy.GetDiffRequest{ StartRoot: start, EndRoot: end, }) if err != nil { return nil, nil, err } return rsp2.WriteLog, pf, nil }

Nice! Indeed this approach is more readable compared to what I tried in 9a14979 (i.e. I tried avoding mutex for any cost - bad).

Anyways, will prepare one PR in-front and then I co-author you for this part :)

We should also create the checkpointer struct, that requires it's last finalized round to be updated, just like the nudger.

But should state sync worker really be responsible for this orchestration and know about role providers, consensus checkpoints, etc.

How about we make it only responsible for the state initialization, checkpoint sync and diff sync. Moreover the diff sync is another worker, consisting of few structs that you provided above.

Nudger, checkpointer and pruner should instead be moved out, (possible each into independent package under /storage/*, like is the case with statesync rn). Instead of pushing the updates to them, they can watchFinalizedRounds.

Then inside storage.Worker where we registerRuntime, we could have per runtime orchestration worker (nudger, checkpointer, statesync) that also implements hook interface + fan-out of new blocks coming from the hook subscription.

Over-engineering? Code wise looks even simpler to me, maybe a bit unsure how to organize/name the packages if we go this way...

We should also create the checkpointer struct, that requires it's last finalized round to be updated, just like the nudger.

Probably, I just gave you few examples.

But should state sync worker really be responsible for this orchestration and know about role providers, consensus checkpoints, etc.

Maybe not, unless we rename it.

How about we make it only responsible for the state initialization, checkpoint sync and diff sync. Moreover the diff sync is another worker, consisting of few structs that you provided above.

Can we do this in 2 steps, where the second step would be what you have written above?

peternose · 2025-08-21T15:31:16Z

go/worker/storage/statesync/diff_sync.go

+//
+// Suggestion: Ideally syncDiffs is refactored into independent worker and made only
+// responsible for the syncing.
+func (w *Worker) syncDiffs(


While it is nice to separate diff and state, I don't prefer having worker functions scattered across multiple files. Having a sub-workers for this would solve the problem, and the main worker would just delegate tasks.

Even though you refactored the state sync worker, I think that there is still a lot of things to be done as the code is still very unclear in functions are way too long. I would try to break the worker into sub-workers, to break down the code into smaller pieces which are easier to understand.

I guess its mostly about getting there incrementally. Anyways, I see your point also so there is a proposal below.

While it is nice to separate diff and state, I don't prefer having worker functions scattered across multiple files.

Not sure I agree with this else why we already have checkpoint_sync.go and checkpointer.go files. I do find the code organised in this way (even if part of the same worker) way easier to navigate and reason.

Ideally, we should also factor out state initialization out of the main worker into separate file (or at least function). Finally CheckpointSyncRetry could be moved to checkpoint_sync.go. But this would really be out of scope, as I am focusing on the diff sync here mostly.

Having a sub-workers for this would solve the problem, and the main worker would just delegate tasks.

I tried to do this in the follow-up as I think this would be way to much for one PR. See 9a14979, where this becomes an independent worker.

How about I cut this PR into two:

All commits from the start all the way to avoiding panic (including).

Make storage worker stateless with regards to context/use adapter you suggested (new)

Add timeout to checkpoint restoration (new)

Avoid potential deadlock on the clean-up.

Make checkpoint.Checkpointer not require to take ctx - go/worker/storage: Refactor state sync worker #6299 (comment)

The follow-up can then be creating diff sync independent worker which on a high level consists of:

Refactor - 213b17f.

Move to separate file

Make independent worker - 9a14979

Possibly refactor - 213b17f, could be done as part of the first PR already given that it makes code easier to reason about and is useful in itself.

Maybe you would prefer that?

Not sure I agree with this else why we already have checkpoint_sync.go and checkpointer.go files. I do find the code organised in this way (even if part of the same worker) way easier to navigate and reason.

If you want to split a struct in two files, this means that it has at least two responsibilities and could probably be refactored.

Ideally, we should also factor out state initialization out of the main worker into separate file (or at least function). Finally CheckpointSyncRetry could be moved to checkpoint_sync.go. But this would really be out of scope, as I am focusing on the diff sync here mostly.

Probably yes.

How about I cut this PR into two:

Yes, that would be nice and we could review and merge quicker.

peternose · 2025-08-21T15:34:22Z

go/worker/storage/statesync/diff_sync.go

 	fetchPool := workerpool.New("storage_fetch/" + w.commonNode.Runtime.ID().String())
 	fetchPool.Resize(config.GlobalConfig.Storage.FetcherCount)
 	defer fetchPool.Stop()
+	fetchCtx, cancel := context.WithCancel(ctx)


Shouldn't we just do this on the top ctx, cancel := context.WithCancel(ctx)?

peternose · 2025-08-21T15:36:23Z

go/worker/storage/statesync/diff_sync.go

 			rootType := prevRoots[i].Type
 			if !syncing.outstanding.contains(rootType) && syncing.awaitingRetry.contains(rootType) {
 				syncing.scheduleDiff(rootType)
-				wg.Add(1)


I don't like leaving go routines to be killed once the program exists as they might corrupt something.

martintomazic force-pushed the martin/trivial/state-sync-refactor-1 branch from bd53717 to 329dfe3 Compare August 19, 2025 12:23

martintomazic commented Aug 19, 2025

View reviewed changes

martintomazic added 2 commits August 19, 2025 14:31

go/worker/storage: Rename committee package to statesync

0b3d6e2

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

go/worker/storage/statesync: Move pruning to separate file

10b4705

martintomazic force-pushed the martin/trivial/state-sync-refactor-1 branch from 329dfe3 to 58d1c6a Compare August 19, 2025 12:34

martintomazic marked this pull request as ready for review August 19, 2025 13:03

martintomazic requested review from kostko, peterjgilbert, peternose, pro-wh and ptrus as code owners August 19, 2025 13:03

martintomazic added 9 commits August 20, 2025 14:09

go/worker/storage/statesync: Move checkpointert to separate file

8bbc085

Logic was preserved, the only thing that changed is that context is passed explicitly and worker for creating checkpoints was renamed.

go/worker/storage/statesync: Remove redundant context

a4e9069

Probably the timeout should be the client responsibility.

go/worker/storage/statesync: Do not panic

a27b967

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

go/worker/storage/statesync: Move syncing methods at the bottom

8dff06f

go/worker/storage/statesync: Refactor the code

213b17f

The code was broken into smaller functions. Also the scope of variables (including channels) have been reduced. Semantics as well as performance should stay the same.

go/worker/storage/statesync: Prevent deadlock when terminating

67b006e

Previously, if the worker returned an error it would exit main for loop and wait for the waitgroup to be emptied. However, this is not possible as there is no one that is reading the fetched diffs.

martintomazic force-pushed the martin/trivial/state-sync-refactor-1 branch from 58d1c6a to dca1f4b Compare August 20, 2025 12:27

peternose reviewed Aug 21, 2025

View reviewed changes

This was referenced Aug 25, 2025

go/worker/storage: Refactor state sync worker pt1 #6306

Merged

State sync worker should be only responsible for initializing and syncing the state #6308

Draft

martintomazic mentioned this pull request Sep 9, 2025

Optimize runtime iterative state sync #6241

Open

martintomazic marked this pull request as draft September 9, 2025 13:39

martintomazic linked an issue Nov 13, 2025 that may be closed by this pull request

Refactor runtime storage committee worker into smaller and independent workers #6307

Open

martintomazic mentioned this pull request Nov 29, 2025

Refactor runtime storage committee worker into smaller and independent workers #6307

Open

		// The request relies on the default timeout of the underlying p2p protocol clients.
		//

	w.logger.Info("starting state sycne worker")
	w.logger.Info("starting state sync worker")

go/worker/storage: Refactor state sync worker #6299

Are you sure you want to change the base?

go/worker/storage: Refactor state sync worker #6299

Uh oh!

Conversation

martintomazic commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 19, 2025

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peternose left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peternose Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martintomazic Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peternose Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

martintomazic commented Aug 19, 2025 •

edited

Loading

netlify bot commented Aug 19, 2025 •

edited

Loading

peternose Aug 21, 2025 •

edited

Loading

martintomazic Aug 24, 2025 •

edited

Loading

peternose Aug 21, 2025 •

edited

Loading

martintomazic Aug 21, 2025 •

edited

Loading