Skip to content

Conversation

@martintomazic
Copy link
Contributor

@martintomazic martintomazic commented Aug 25, 2025

What was done:

  1. Pass context explicitly.
  2. Fix:
    • Panicking.
    • Deadlocking on the clean-up.
    • Triggering genesis checkpoint to early.
  3. Minor refactors.

Follow-up:

I propose factoring out checkpointer and availability nudger (#6308) into separate workers. Could be one PR for each.
This will make diff sync part easier to reason about and thus optimize.

@netlify
Copy link

netlify bot commented Aug 25, 2025

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 91ed0da
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68cc0275a754f500080a9776

@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch 4 times, most recently from a25e3c2 to c22279e Compare August 27, 2025 10:23
@martintomazic
Copy link
Contributor Author

For the non-synchronized documentation I should prepare a PR and merge intermediately after?

@martintomazic martintomazic marked this pull request as ready for review August 27, 2025 10:29
@martintomazic martintomazic marked this pull request as draft August 27, 2025 10:33
@martintomazic
Copy link
Contributor Author

Let me fix the linting first and then I open it again. :)

@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch from c22279e to ce420f1 Compare August 27, 2025 11:08
@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch 2 times, most recently from 619ca25 to b65ffef Compare August 30, 2025 07:26
@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch 2 times, most recently from 1f62446 to b6a785c Compare September 12, 2025 08:59
)

ctx, cancel := context.WithCancel(w.ctx)
ctx, cancel := context.WithTimeout(w.ctx, diffResponseTimeout)
Copy link
Contributor Author

@martintomazic martintomazic Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this one as we discussed in the other threads by simply copying the client internal timeout (defensive).

But this should probably be reset per call or increased 2x given that is now used for calling legacy and new protocol internally...

update: tempted to drop this commit alltogether :).

Copy link
Contributor Author

@martintomazic martintomazic Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this down to getDiff now and reset it.

@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch from b6a785c to 2d5d1e0 Compare September 12, 2025 09:13
Copy link
Collaborator

@peternose peternose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

ctx, cancel := context.WithCancel(ctx)
defer cancel()

go func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not wg.Go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed redundant select in this go routine (captured by static check not part of our CI lint).

@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch 2 times, most recently from 08b1526 to 406b4ff Compare September 12, 2025 14:43
Also rename node to worker, to avoid confusion.
Previously, genesis checkpoint was created right after the state
may be initialized from the runtime descriptor. If this is not the
case it is first fetched from the peers.

Thus, we should force the genesis checkpoint only after checkpoint
sync finishes.

Finally, redundant nil check was removed.
This is desirable, so that worker that initilize a new
checkpointer don't require accepting context, but instead the
lifetime and initialization of checkpointer is handled by the
worker's Serve method.
Previously if the context was not canceled the fetcher might
be sending the diffs on a channel that cannot be emptied,
since we are already out of the main for loop, resulting in
wg.Wait to never complete.
This also serves as step towards passing the context explicitly.
In addition, committee storage worker now implements the
Service interface. The corresponding BackgroundService
methods (already not used) have been removed.

Similarly, the storage worker was internally refactored
to Service interface to ease eventual removal of the
BackgroundService.

Additionally, observe that the parent (storage worker) is
registered as background service, thus upon error inside
committee worker there is no need to manually request the
node shutdown.

Finally, panicking has been replaced with error.

Semantic changed slightly: Previously storage worker
would wait for all committee workers to finish. Now it will
terminate when the first one finishes. This was already the
case if the committee worker panicked.
@martintomazic martintomazic force-pushed the martin/trivial/state-sync-refactor branch from 406b4ff to 91ed0da Compare September 18, 2025 13:00
@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 82.71028% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.56%. Comparing base (0a57136) to head (91ed0da).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
go/worker/storage/committee/worker.go 88.08% 26 Missing and 2 partials ⚠️
go/worker/storage/committee/checkpoint_sync.go 59.52% 12 Missing and 5 partials ⚠️
go/worker/storage/committee/prune.go 36.36% 13 Missing and 1 partial ⚠️
go/storage/mkvs/checkpoint/checkpointer.go 87.35% 9 Missing and 2 partials ⚠️
go/worker/storage/service_internal.go 0.00% 3 Missing ⚠️
go/worker/storage/worker.go 96.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6306      +/-   ##
==========================================
+ Coverage   64.51%   64.56%   +0.05%     
==========================================
  Files         697      698       +1     
  Lines       67847    67826      -21     
==========================================
+ Hits        43770    43792      +22     
+ Misses      19089    19046      -43     
  Partials     4988     4988              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martintomazic martintomazic merged commit 6079b2a into master Sep 19, 2025
7 checks passed
@martintomazic martintomazic deleted the martin/trivial/state-sync-refactor branch September 19, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor runtime storage committee worker into smaller and independent workers

3 participants