Skip to content

Conversation

@PastaPastaPasta
Copy link
Member

Issue being fixed or feature implemented

I don't have good data as to if this results in improvement. However, when sigShare BLS signing (if ever) takes more than a full thread, we will be bottlenecked.

This PR addresses two concerns

  1. Separate out cleanup and network sending logic into it's own thread

This should prevent the network sending or cleanup logic from slowing down the creation of sigShares

  1. Use a thread pool and dispatch jobs to that pool
    This should enable scaling sigShare creation past 1 full thread, and currently could use up to half of available cores for sigShare signing.

How Has This Been Tested?

This hadn't been tested; please review, but I'll want to deploy this on testnet before merging I think.

Breaking Changes

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation
  • I have assigned this pull request to a milestone (for repository code-owners and collaborators only)

@PastaPastaPasta PastaPastaPasta added this to the 23.1 milestone Nov 22, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 22, 2025

Walkthrough

The single dedicated worker thread in CSigSharesManager was replaced with a capped ctpl::thread_pool plus two dedicated threads: a WorkDispatcher thread and a Housekeeping thread. Lifecycle methods StartWorkerThread/StopWorkerThread were renamed to Start/Stop. A new public AsyncSign(quorum, id, msgHash) enqueues signing tasks. Internal flow was refactored into DispatchPendingSigns, DispatchPendingProcessing, ProcessPendingSigSharesLoop, and SignAndProcessSingleShare worker tasks. ctpl_stl.h was added and the base listener type changed to llmq::CRecoveredSigsListener.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Manager as CSigSharesManager
    participant Dispatcher as WorkDispatcherThread
    participant Pool as WorkerPool
    participant Housekeeping as HousekeepingThread

    Caller->>Manager: AsyncSign(quorum,id,msgHash)
    activate Manager
    Manager->>Manager: enqueue PendingSignatureData
    deactivate Manager

    Dispatcher->>Manager: WorkDispatcherThreadMain()
    activate Dispatcher
    Dispatcher->>Manager: DispatchPendingSigns()
    activate Manager
    Manager->>Pool: enqueue SignAndProcessSingleShare tasks
    deactivate Manager
    Dispatcher->>Manager: DispatchPendingProcessing()
    activate Manager
    Manager->>Pool: maybe spawn ProcessPendingSigSharesLoop
    deactivate Manager
    deactivate Dispatcher

    Pool->>Pool: SignAndProcessSingleShare()
    activate Pool
    Note over Pool: create/process single sigShare\nupdate session state when ready
    deactivate Pool

    Pool->>Pool: ProcessPendingSigSharesLoop()
    activate Pool
    Note over Pool: loop ProcessPendingSigShares until none remain
    deactivate Pool

    Housekeeping->>Housekeeping: HousekeepingThreadMain()
    activate Housekeeping
    Note over Housekeeping: periodic RemoveBannedNodeStates,\nSendMessages, Cleanup
    deactivate Housekeeping
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Verify workerPool initialization/resizing and graceful shutdown (clearing pending tasks).
  • Review Start()/Stop() lifecycle ordering and thread joins to avoid races with registration/unregistration.
  • Inspect DispatchPendingSigns/DispatchPendingProcessing handoff and task enqueue correctness.
  • Review SignAndProcessSingleShare and ProcessPendingSigSharesLoop for concurrency correctness and state updates.
  • Check inclusion/usage of ctpl_stl.h and thread naming/management.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: refactoring CSigSharesManager to introduce multi-threaded worker pool and dispatcher architecture.
Description check ✅ Passed The description clearly explains the motivation, implementation approach, and testing plan for the multi-threaded worker pool refactor.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llmq/signing_shares.cpp (1)

1-1: Critical: Fix code formatting issues

The CI pipeline reports clang-format differences in this file. Please run clang-format to fix the formatting issues before merging.

#!/bin/bash
# Run clang-format on the file to see differences
git diff -U0 --no-color HEAD^ HEAD -- src/llmq/signing_shares.cpp | clang-format-diff.py -p1
🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)

1665-1683: Consider spawn frequency for processing workers

The dispatcher calls DispatchPendingProcessing() every 10ms. If pending work exists and processing takes longer than 10ms, this could queue multiple ProcessPendingSigSharesLoop() tasks in the worker pool.

While the fixed-size worker pool (1-4 workers) prevents unbounded thread creation, queuing many identical loop tasks could lead to redundant work since ProcessPendingSigSharesLoop() will exit once it finds no more work.

Consider adding a flag to prevent queuing additional processing tasks when one is already queued or running, or verify this pattern is intentional for load distribution.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9faf42 and f892f79.

📒 Files selected for processing (2)
  • src/llmq/signing_shares.cpp (2 hunks)
  • src/llmq/signing_shares.h (5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
📚 Learning: 2025-09-09T21:36:11.833Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6837
File: src/stats/rawsender.cpp:0-0
Timestamp: 2025-09-09T21:36:11.833Z
Learning: In RawSender class (src/stats/rawsender.cpp), cs_net is the appropriate mutex for protecting socket access (m_sock) and network operations, not additional custom locks. The implementation correctly uses cs_net with GUARDED_BY annotations and EXCLUSIVE_LOCKS_REQUIRED to synchronize socket access between SendDirectly() and ReconnectThread().

Applied to files:

  • src/llmq/signing_shares.h
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.

Applied to files:

  • src/llmq/signing_shares.h
🧬 Code graph analysis (2)
src/llmq/signing_shares.h (2)
src/llmq/signing.h (1)
  • CRecoveredSigsListener (158-164)
src/masternode/active/context.cpp (4)
  • Start (54-59)
  • Start (54-54)
  • Stop (61-66)
  • Stop (61-61)
src/llmq/signing_shares.cpp (3)
src/llmq/context.cpp (4)
  • Start (50-55)
  • Start (50-50)
  • Stop (57-62)
  • Stop (57-57)
src/llmq/quorums.cpp (4)
  • Start (237-243)
  • Start (237-237)
  • Stop (245-250)
  • Stop (245-245)
src/masternode/active/context.cpp (4)
  • Start (54-59)
  • Start (54-54)
  • Stop (61-66)
  • Stop (61-61)
🪛 GitHub Actions: Clang Diff Format Check
src/llmq/signing_shares.cpp

[error] 1-1: Clang-format differences found in src/llmq/signing_shares.cpp. Run the clang-format-diff.py tool (or clang-format) to fix code formatting. Command diff step exited with code 1 due to differences.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: linux64_tsan-build / Build source
  • GitHub Check: x86_64-w64-mingw32 / Build depends
  • GitHub Check: x86_64-apple-darwin / Build depends
  • GitHub Check: Lint / Run linters
  • GitHub Check: arm-linux-gnueabihf / Build depends
  • GitHub Check: x86_64-pc-linux-gnu / Build depends
  • GitHub Check: x86_64-pc-linux-gnu_nowallet / Build depends
🔇 Additional comments (14)
src/llmq/signing_shares.h (6)

9-9: LGTM: Thread pool library inclusion

The ctpl_stl.h include is necessary for the worker pool architecture introduced in this refactor.


364-364: LGTM: Explicit namespace qualification

Making the namespace explicit improves code clarity without changing functionality.


383-386: LGTM: Thread pool architecture

The shift from a single workThread to a workerPool with dedicated housekeepingThread and dispatcherThread aligns with the PR objectives to enable parallel sigShare signing.


439-440: LGTM: New public API for async signing

The new AsyncSign() method provides a cleaner interface for enqueueing signing tasks to the worker pool. The lock annotations correctly specify that callers must not hold either cs_pendingSigns or cs.


398-404: LGTM: Data structure for worker tasks

The PendingSignatureData struct appropriately captures the data needed for each signing task. Using const members helps ensure thread safety.


503-513: LGTM: Well-structured dispatcher and worker methods

The new thread entry points and dispatcher methods are properly declared with correct lock annotations. The separation of concerns (housekeeping, dispatching, and worker tasks) is clear and maintainable.

src/llmq/signing_shares.cpp (8)

220-239: LGTM: Proper thread pool shutdown sequence

The Stop() method correctly:

  • Clears the worker pool queue to prevent new tasks from executing
  • Stops the pool with wait=true to allow in-flight tasks to complete
  • Joins threads only if they're joinable

The requirement to call InterruptWorkerThread() before Stop() is documented and enforced (line 222-225).


1614-1630: LGTM: Housekeeping thread implementation

The housekeeping thread correctly handles:

  • Cleaning up banned node states
  • Rate-limiting message sending to every 100ms
  • Running periodic cleanup
  • Clean shutdown via workInterrupt

1632-1644: LGTM: Dispatcher thread implementation

The dispatcher thread appropriately:

  • Dispatches all pending sign tasks to the worker pool
  • Conditionally spawns workers when there's incoming signature share processing work
  • Uses a 10ms sleep interval for responsive dispatching
  • Handles clean shutdown via workInterrupt

1646-1663: LGTM: Efficient pending sign dispatch

The method correctly drains all pending signs from the queue and dispatches them to the worker pool. The use of std::optional and move semantics ensures efficient data transfer without copies.


1685-1694: LGTM: Efficient processing loop

The loop correctly processes batches of pending signature shares until no work remains, then exits immediately to free the worker for other tasks.


1696-1713: LGTM: Single share signing and processing

The method correctly:

  • Creates a signature share
  • Validates it before processing
  • Updates the signed sessions when all-members-connected is enabled
  • Acquires appropriate locks for shared state updates

1715-1719: LGTM: Clean async signing API

The AsyncSign() method provides a simple, thread-safe interface for enqueueing signing tasks. The use of move semantics for the quorum pointer is appropriate.


199-218: Remove incorrect comparison and clarify that this pattern is not widely established

The cited comparison to CQuorumManager::Start() in src/llmq/quorums.cpp:236-242 is incorrect—that function does not contain a joinable/assert(false) pattern. While similar assert(false) checks on joinable() conditions exist in src/instantsend/net_instantsend.cpp and src/llmq/signing.cpp, this pattern is not a widespread established practice in the codebase.

The underlying concern about assert(false) having no effect in release builds with NDEBUG remains valid, but the justification claiming this follows an established pattern should be removed.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/llmq/signing_shares.cpp (4)

199-239: Start/Stop lifecycle for worker pool and helper threads is reasonable; consider documenting/guarding single-use assumption

The new Start()/Stop() correctly:

  • Prevent starting when housekeeping/dispatcher threads are already joinable.
  • Size and name the worker pool (1–4 threads, ~half of hardware_concurrency()).
  • Launch a housekeeping thread and a dispatcher thread wrapped in util::TraceThread.
  • On Stop(), require workInterrupt to have been triggered, then clear the queue, stop the pool, and join both helper threads.

This looks sound given the existing contract that InterruptWorkerThread() is called before Stop(). One subtle point: Start() doesn’t reset workInterrupt, so this instance is effectively intended to have a single Start/Interrupt/Stop lifecycle. If you ever expect multiple Start/Stop cycles on the same CSigSharesManager instance, you’ll need either a workInterrupt.reset() (plus appropriate asserts) in Start() or an explicit comment asserting single-use semantics.


1639-1687: DispatchPendingSigns / DispatchPendingProcessing: concurrency model is safe; optional guard to limit helper loop fan‑out

The new scheduling looks correct:

  • DispatchPendingSigns() atomically swaps pendingSigns under cs_pendingSigns and enqueues one SignAndProcessSingleShare task per pending item into workerPool, breaking early if workInterrupt is set.
  • DispatchPendingProcessing() checks under cs whether any pendingIncomingSigShares remain and, when true, enqueues a ProcessPendingSigSharesLoop() helper task.
  • ProcessPendingSigSharesLoop() repeatedly calls ProcessPendingSigShares() until it reports no more work or workInterrupt becomes true; ProcessPendingSigShares() itself is internally synchronized with cs.

Functionally this allows multiple workers to help drain the verification/processing backlog without data races. The only potential concern is that DispatchPendingProcessing() can enqueue additional ProcessPendingSigSharesLoop() tasks every 10 ms while work exists, even if one or more such helpers are already running; extra helpers will quickly exit when they find no more work, but they still consume queue entries.

If you see worker-pool saturation or unnecessary task churn in profiling, consider adding a lightweight guard (e.g., an atomic “processing loop in flight” flag cleared when ProcessPendingSigSharesLoop() exits) so at most one such loop runs at a time, relying on that loop to drain the backlog. This is an optional optimization; the current design is correct but may be slightly more chatty in very busy scenarios.


1689-1712: AsyncSign + SignAndProcessSingleShare look correct; minor perf and thread‑safety considerations

The end-to-end path:

  • AsyncSign() queues PendingSignatureData under cs_pendingSigns.
  • The dispatcher swaps pendingSigns out and enqueues SignAndProcessSingleShare() tasks into workerPool.
  • Each worker calls CreateSigShare(*work.quorum, work.id, work.msgHash) and, on success, feeds the result into ProcessSigShare(), updating signedSessions under cs for all-members-connected mode.

This is logically sound and respects existing locking (cs_pendingSigns and cs). Two small points:

  1. CreateSigShare() already verifies the produced signature and only returns a std::optional<CSigShare> on success, so the extra opt_sigShare->sigShare.Get().IsValid() check redoes expensive BLS work. You could rely on CreateSigShare()’s contract (or move the validity check there exclusively) to avoid this redundant verification on a hot path.

  2. BLS signing is now happening concurrently across multiple worker threads. That’s the goal of this refactor, but it does assume the underlying BLS implementation (including m_mn_activeman.Sign() and CBLSSecretKey::Sign()) is thread-safe for concurrent use. It likely is, but it’s worth double-checking with library documentation or stress tests.


1716-1754: CreateSigShare membership checks and single‑member path are solid; logging format could be clarified

The updated CreateSigShare():

  • Fetches activeMasterNodeProTxHash once and immediately returns std::nullopt if quorum.IsValidMember() is false, preventing accidental signing when this node isn’t a valid member.
  • Handles the is_single_member() case explicitly by:
    • Deriving memberIdx consistently from GetMemberIndex().
    • Signing with m_mn_activeman.Sign() and validating the resulting lazy BLS signature before returning.
  • Keeps the multi-member path using GetSkShare() with a validity check, then signs and validates similarly.

This tightens correctness around membership and makes the single-member quorum behavior explicit. One minor nit: both failure and success logs here use time=%s with t.count() as the corresponding argument, while other timers in this file use an integer format specifier (e.g., %d). For consistency and to avoid confusion about the underlying type, consider switching these format specifiers to match the rest of the file (e.g., %d or another numeric specifier appropriate for cxxtimer::Timer::count()).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f892f79 and 228e568.

📒 Files selected for processing (2)
  • src/llmq/signing_shares.cpp (2 hunks)
  • src/masternode/active/context.cpp (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.

Applied to files:

  • src/masternode/active/context.cpp
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: In src/net_processing.cpp, if ActiveContext (m_active_ctx) is non-null, its members (including cj_server) are guaranteed to be initialized; call sites can safely dereference m_active_ctx->cj_server without an additional null-check.

Applied to files:

  • src/masternode/active/context.cpp
🧬 Code graph analysis (2)
src/masternode/active/context.cpp (2)
src/llmq/signing_shares.cpp (2)
  • Stop (220-239)
  • Stop (220-220)
src/llmq/context.cpp (2)
  • Stop (57-62)
  • Stop (57-57)
src/llmq/signing_shares.cpp (1)
src/llmq/context.cpp (4)
  • Start (50-55)
  • Start (50-50)
  • Stop (57-62)
  • Stop (57-57)
🪛 GitHub Actions: Clang Diff Format Check
src/llmq/signing_shares.cpp

[error] 1-1: Clang format differences detected by clang-format-diff.py. Please run the formatting script to update src/llmq/signing_shares.cpp (diff shown in logs) and re-run CI.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: linux64_tsan-build / Build source
  • GitHub Check: Lint / Run linters
  • GitHub Check: x86_64-w64-mingw32 / Build depends
  • GitHub Check: x86_64-apple-darwin / Build depends
  • GitHub Check: x86_64-pc-linux-gnu_nowallet / Build depends
  • GitHub Check: x86_64-pc-linux-gnu / Build depends
  • GitHub Check: arm-linux-gnueabihf / Build depends
🔇 Additional comments (3)
src/masternode/active/context.cpp (1)

54-66: CSigSharesManager Start/Stop usage and ordering look correct

Swapping StartWorkerThread()/StopWorkerThread() for Start()/Stop() here keeps the lifecycle and ordering intact: you still start QDKGS threads, then register as a recovered-sigs listener, then start shareman, and reverse that on shutdown. With Interrupt() still delegating to shareman->InterruptWorkerThread(), the new multi-threaded manager should integrate cleanly with the existing ActiveContext shutdown flow.

src/llmq/signing_shares.cpp (2)

1614-1637: Housekeeping and dispatcher threads cleanly separate concerns

Moving RemoveBannedNodeStates(), SendMessages(), and Cleanup() into HousekeepingThreadMain() and using a separate WorkDispatcherThreadMain() to drive signing/processing work cleanly decouples network/cleanup duties from BLS signing. Both loops respect workInterrupt and use sleep_for to avoid busy-waiting, which should behave similarly to the old single-thread loop but without blocking signing on slow networking or cleanup.

Assuming there are no other call sites invoking these methods concurrently (which matches what’s visible here), this separation looks correct.


199-239: Verify formatting by running clang-format locally on this file

The file src/llmq/signing_shares.cpp is subject to the clang-format check (matched by src/llmq/*.cpp in the non-backported.txt list). The code as currently shown appears properly formatted, but to confirm the CI formatting check will pass, please run the project's clang-format tooling locally and verify no changes are reported:

./contrib/devtools/clang-format-diff.py -p1 < (git diff origin/develop -- src/llmq/signing_shares.cpp)

If this produces no output, the formatting is correct and the comment can be resolved.

@UdjinM6
Copy link

UdjinM6 commented Nov 22, 2025

pls consider fd621a3 and 3d6fd57

@PastaPastaPasta
Copy link
Member Author

@UdjinM6

It was intentional to loop here to not force us to have the 10ms wait, and allow multiple workers to process batches simultaneously / in parallel. This would in practice limit us to just 1 worker at a time.

@UdjinM6
Copy link

UdjinM6 commented Nov 22, 2025

Makes sense. That's not how we use worker pool usually (providing specific pieces to work on) but it should work I guess 👍

UdjinM6
UdjinM6 previously approved these changes Nov 22, 2025
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 228e568

@UdjinM6
Copy link

UdjinM6 commented Nov 22, 2025

nit: could maybe apply clang-format suggestions

@github-actions
Copy link

This pull request has conflicts, please rebase.

… thread safety

- Removed unnecessary lastSendTime variable and simplified message sending in HousekeepingThreadMain.
- Enhanced DispatchPendingSigns by swapping the entire vector of pending signs to reduce lock contention and improve performance.
- Updated ActiveContext to start and stop the share manager more efficiently.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)

1679-1683: Consider tracking in-flight processing workers to avoid redundant tasks.

The current implementation spawns a new worker every 10ms if work exists, even if a previously spawned worker is still processing. While the fixed-size pool prevents unbounded resource usage, this could queue many redundant tasks if work persists.

Consider adding a flag to track whether a processing worker is already active:

+ std::atomic<bool> processingActive{false};
+
  void CSigSharesManager::DispatchPendingProcessing()
  {
+     if (processingActive.load()) {
+         return;  // Already processing
+     }
+
      // Check if there's work, spawn a helper if so
      bool hasWork = false;
      {
          LOCK(cs);
          hasWork = std::any_of(nodeStates.begin(), nodeStates.end(),
              [](const auto& entry) {
                  return !entry.second.pendingIncomingSigShares.Empty();
              });
      }

      if (hasWork) {
+         processingActive.store(true);
          // Work exists - spawn a worker to help!
          workerPool.push([this](int) {
+             processingActive.store(false);
              ProcessPendingSigSharesLoop();
+             processingActive.store(false);
          });
      }
  }

Note: This optimization was discussed in past review comments and deemed acceptable as-is.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 228e568 and b8e0762.

📒 Files selected for processing (3)
  • src/llmq/signing_shares.cpp (2 hunks)
  • src/llmq/signing_shares.h (5 hunks)
  • src/masternode/active/context.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/masternode/active/context.cpp
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.{cpp,h,hpp,cc}

📄 CodeRabbit inference engine (CLAUDE.md)

Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1

Files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
src/{masternode,evo,llmq,governance,coinjoin}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Dash-specific database implementations: CFlatDB for persistent storage (MasternodeMetaStore, GovernanceStore, SporkStore, NetFulfilledRequestStore) and CDBWrapper extensions for Evolution/DKG/InstantSend/Quorum/RecoveredSigs data

Files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
src/{masternode,llmq}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

BLS integration must be used for cryptographic foundation of advanced masternode features

Files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
src/llmq/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

src/llmq/**/*.{cpp,h}: LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention

Files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
src/{masternode,llmq,evo,coinjoin,governance}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

Use unordered_lru_cache for efficient caching with LRU eviction in Dash-specific data structures

Files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
🧠 Learnings (7)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)

Applied to files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus

Applied to files:

  • src/llmq/signing_shares.cpp
  • src/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/{masternode,llmq}/**/*.{cpp,h} : BLS integration must be used for cryptographic foundation of advanced masternode features

Applied to files:

  • src/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention

Applied to files:

  • src/llmq/signing_shares.h
📚 Learning: 2025-09-09T21:36:11.833Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6837
File: src/stats/rawsender.cpp:0-0
Timestamp: 2025-09-09T21:36:11.833Z
Learning: In RawSender class (src/stats/rawsender.cpp), cs_net is the appropriate mutex for protecting socket access (m_sock) and network operations, not additional custom locks. The implementation correctly uses cs_net with GUARDED_BY annotations and EXCLUSIVE_LOCKS_REQUIRED to synchronize socket access between SendDirectly() and ReconnectThread().

Applied to files:

  • src/llmq/signing_shares.h
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.

Applied to files:

  • src/llmq/signing_shares.h
🧬 Code graph analysis (2)
src/llmq/signing_shares.cpp (2)
src/llmq/context.cpp (4)
  • Start (50-55)
  • Start (50-50)
  • Stop (57-62)
  • Stop (57-57)
src/llmq/quorums.cpp (4)
  • Start (237-243)
  • Start (237-237)
  • Stop (245-250)
  • Stop (245-245)
src/llmq/signing_shares.h (2)
src/llmq/context.h (1)
  • llmq (22-31)
src/llmq/signing.h (1)
  • CRecoveredSigsListener (158-164)
🔇 Additional comments (9)
src/llmq/signing_shares.h (3)

364-364: LGTM: Explicit namespace qualification improves clarity.

Making the base class namespace explicit is a good practice and aligns with the class definition in src/llmq/signing.h.


431-432: Lifecycle method renaming aligns with codebase patterns.

The renaming from StartWorkerThread()/StopWorkerThread() to Start()/Stop() is consistent with CQuorumManager's lifecycle methods and was already addressed per past review comments.


9-9: ✓ ctpl_stl.h is correctly available and integrated in the repository.

The CTPL thread pool library is integrated into Dash Core, and the file src/ctpl_stl.h exists in the repository. The inclusion of #include <ctpl_stl.h> is a standard and valid approach for this header-only thread pool library.

src/llmq/signing_shares.cpp (6)

207-209: LGTM: Worker pool sizing matches codebase patterns.

The worker pool initialization using hardware_concurrency() / 2 clamped to [1, 4] workers is consistent with CQuorumManager::Start() implementation.


1623-1632: LGTM: Housekeeping thread performs periodic maintenance correctly.

The housekeeping thread properly loops with interrupt checking and reasonable 100ms sleep interval for maintenance tasks.


1634-1646: LGTM: Dispatcher thread enables parallel batch processing.

The 10ms sleep interval is intentionally short (vs. housekeeping's 100ms) to allow multiple workers to process batches in parallel, as explained by the PR author.


1648-1665: LGTM: Efficient lock contention reduction through vector swap.

The swap pattern minimizes lock hold time and the move semantics avoid unnecessary copies when dispatching work to the pool.


1687-1696: LGTM: Work-stealing loop correctly processes batches until completion.

The loop continues processing batches while work remains and exits immediately when none is found, avoiding unnecessary spinning.


1698-1721: LGTM: Sign and async enqueue implementations are correct.

Both SignAndProcessSingleShare() and AsyncSign() properly handle signature creation, validation, and thread-safe enqueueing with appropriate lock usage.

UdjinM6
UdjinM6 previously approved these changes Nov 24, 2025
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK b8e0762

[this] { HousekeepingThreadMain(); });

// Start dispatcher thread
dispatcherThread = std::thread(&util::TraceThread, "sigsh-dispatch",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure, but thread names likely is limited to 12 bytes, aren't they?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)

1698-1721: Remove redundant BLS validity check in SignAndProcessSingleShare

Both CreateSigShare() (lines 1789) and CreateSigShareForSingleMember() (lines 1742) validate the signature with Get().IsValid() and return std::nullopt if invalid. When they return a non-empty optional, the signature is already guaranteed valid. Re-checking with Get().IsValid() in SignAndProcessSingleShare() (lines 1702–1703) is redundant and incurs an unnecessary mutex lock and Get() call.

Simplify to:

     auto opt_sigShare = CreateSigShare(*work.quorum, work.id, work.msgHash);
-    if (opt_sigShare.has_value() && opt_sigShare->sigShare.Get().IsValid()) {
+    if (opt_sigShare) {
         auto sigShare = std::move(*opt_sigShare);

This preserves behavior while eliminating the second validation check.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8e0762 and 43635ef.

📒 Files selected for processing (1)
  • src/llmq/signing_shares.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.{cpp,h,hpp,cc}

📄 CodeRabbit inference engine (CLAUDE.md)

Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1

Files:

  • src/llmq/signing_shares.cpp
src/{masternode,evo,llmq,governance,coinjoin}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Dash-specific database implementations: CFlatDB for persistent storage (MasternodeMetaStore, GovernanceStore, SporkStore, NetFulfilledRequestStore) and CDBWrapper extensions for Evolution/DKG/InstantSend/Quorum/RecoveredSigs data

Files:

  • src/llmq/signing_shares.cpp
src/{masternode,llmq}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

BLS integration must be used for cryptographic foundation of advanced masternode features

Files:

  • src/llmq/signing_shares.cpp
src/llmq/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

src/llmq/**/*.{cpp,h}: LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention

Files:

  • src/llmq/signing_shares.cpp
src/{masternode,llmq,evo,coinjoin,governance}/**/*.{cpp,h}

📄 CodeRabbit inference engine (CLAUDE.md)

Use unordered_lru_cache for efficient caching with LRU eviction in Dash-specific data structures

Files:

  • src/llmq/signing_shares.cpp
🧠 Learnings (14)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-07-29T14:32:48.369Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T19:52:02.686Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 7004
File: src/llmq/signing_shares.cpp:201-204
Timestamp: 2025-11-24T19:52:02.686Z
Learning: In Dash Core, assertions cannot be disabled during compilation. The codebase always has assertions enabled, so concerns about release builds continuing past assert statements are not applicable.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-07-17T15:48:29.418Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6752
File: src/wallet/load.cpp:164-164
Timestamp: 2025-07-17T15:48:29.418Z
Learning: In Dash Core, kwvg prefers using assert() statements to document expected behavior in code paths, especially when the called function also has internal asserts. This is used as a defensive programming pattern rather than requiring conditional error handling.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Dash extends Bitcoin Core through composition with minimal changes to the Bitcoin Core foundation

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Applies to src/**/*.{cpp,h,hpp,cc} : Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-11-04T18:24:27.241Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6933
File: src/llmq/utils.cpp:284-298
Timestamp: 2025-11-04T18:24:27.241Z
Learning: In consensus-critical code (such as quorum formation, block validation, or deployment activation logic), do not suggest changes to the logic itself even if the implementation appears theoretically incorrect or off-by-one. Consensus rules, once deployed on the Dash network, must be preserved exactly to avoid network forks. Refactoring PRs should maintain perfect behavioral equivalence. Only suggest logic changes if explicitly accompanied by a DIP (Dash Improvement Proposal) or if the maintainer indicates the consensus rule needs to be changed with appropriate activation logic.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-10-05T20:38:28.457Z
Learnt from: knst
Repo: dashpay/dash PR: 6871
File: contrib/guix/libexec/build.sh:358-360
Timestamp: 2025-10-05T20:38:28.457Z
Learning: In the Dash repository, when backporting code from Bitcoin Core, typos and minor issues in comments should be kept as-is to reduce merge conflicts in future backports, even if they remain unfixed in Bitcoin Core's master branch.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-08-01T07:46:37.840Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-07-15T14:53:04.819Z
Learnt from: knst
Repo: dashpay/dash PR: 6691
File: src/test/llmq_params_tests.cpp:148-151
Timestamp: 2025-07-15T14:53:04.819Z
Learning: In the Dash Core LLMQ implementation, signingActiveQuorumCount is never 0 in the actual parameters defined in params.h, making division by zero scenarios unrealistic in the max_cycles() function.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-06-09T16:43:20.996Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6718
File: test/functional/test_framework/test_framework.py:2102-2102
Timestamp: 2025-06-09T16:43:20.996Z
Learning: In the test framework consolidation PR (#6718), user kwvg prefers to limit functional changes to those directly related to MasternodeInfo, avoiding scope creep even for minor improvements like error handling consistency.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-08-19T14:57:31.801Z
Learnt from: knst
Repo: dashpay/dash PR: 6692
File: src/llmq/blockprocessor.cpp:217-224
Timestamp: 2025-08-19T14:57:31.801Z
Learning: In PR #6692, knst acknowledged a null pointer dereference issue in ProcessBlock() method where LookupBlockIndex may return nullptr but is passed to gsl::not_null, and created follow-up PR #6789 to address it, consistent with avoiding scope creep in performance-focused PRs.

Applied to files:

  • src/llmq/signing_shares.cpp
📚 Learning: 2025-10-13T12:37:12.357Z
Learnt from: knst
Repo: dashpay/dash PR: 6883
File: src/rpc/rawtransaction.cpp:1088-1125
Timestamp: 2025-10-13T12:37:12.357Z
Learning: In backport pull requests (especially from Bitcoin Core), treat "moved" or refactored code as out-of-scope for content-level review. Focus validation on verifying that code is moved correctly: no fields added, no fields removed, no fields reordered, and no unexpected changes beyond whitespace adjustments. Pre-existing issues in the upstream code should be preserved to maintain fidelity to the original implementation.

Applied to files:

  • src/llmq/signing_shares.cpp
🧬 Code graph analysis (1)
src/llmq/signing_shares.cpp (1)
src/llmq/context.cpp (4)
  • Start (50-55)
  • Start (50-50)
  • Stop (57-62)
  • Stop (57-57)
🔇 Additional comments (1)
src/llmq/signing_shares.cpp (1)

1623-1695: Housekeeping and dispatcher threading model looks sound and race‑free

The split into:

  • HousekeepingThreadMain() doing RemoveBannedNodeStates(), SendMessages(), and Cleanup() with a 100 ms sleep (Lines 1623‑1631), and
  • WorkDispatcherThreadMain() looping over DispatchPendingSigns() and DispatchPendingProcessing() with a 10 ms sleep (Lines 1634‑1645),

is consistent with the existing locking scheme:

  • All accesses to nodeStates, pendingIncomingSigShares, and related maps remain protected by cs inside DispatchPendingProcessing() and the ProcessPendingSigShares* helpers (Lines 1669‑1677, 1689‑1693).
  • The signing queue pendingSigns is only touched under cs_pendingSigns in DispatchPendingSigns() and AsyncSign(), avoiding races between producers and the dispatcher (Lines 1651‑1655, 1719‑1721).
  • ProcessPendingSigSharesLoop() loops on ProcessPendingSigShares() until it reports no additional work (Lines 1687‑1693), which matches the existing batching semantics while allowing multiple workers to process batches in parallel.

Given the existing lock usage and the one‑way signaling via workInterrupt, the new thread/dispatcher wiring does not introduce obvious data races and should give the intended parallelism for BLS verification and signing.

Comment on lines +199 to 239
void CSigSharesManager::Start()
{
// can't start new thread if we have one running already
if (workThread.joinable()) {
// can't start if threads are already running
if (housekeepingThread.joinable() || dispatcherThread.joinable()) {
assert(false);
}

workThread = std::thread(&util::TraceThread, "sigshares", [this] { WorkThreadMain(); });
// Initialize worker pool
int workerCount = std::clamp(static_cast<int>(std::thread::hardware_concurrency() / 2), 1, 4);
workerPool.resize(workerCount);
RenameThreadPool(workerPool, "sigsh-work");

// Start housekeeping thread
housekeepingThread = std::thread(&util::TraceThread, "sigsh-maint",
[this] { HousekeepingThreadMain(); });

// Start dispatcher thread
dispatcherThread = std::thread(&util::TraceThread, "sigsh-dispat",
[this] { WorkDispatcherThreadMain(); });
}

void CSigSharesManager::StopWorkerThread()
void CSigSharesManager::Stop()
{
// make sure to call InterruptWorkerThread() first
if (!workInterrupt) {
assert(false);
}

if (workThread.joinable()) {
workThread.join();
// Stop worker pool
workerPool.clear_queue();
workerPool.stop(true);

// Join threads
if (housekeepingThread.joinable()) {
housekeepingThread.join();
}

if (dispatcherThread.joinable()) {
dispatcherThread.join();
}
}
Copy link

@coderabbitai coderabbitai bot Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Clarify Start/Stop one‑shot semantics and consider workerPool shutdown ordering

  • Using assert(false) when threads are already running (Line 201) or when Stop() is called without a prior InterruptWorkerThread() (Line 223) is fine in Dash Core, since assertions are always enabled and will terminate execution on misuse. No additional return is needed here. Based on learnings, this is an accepted defensive pattern rather than recoverable error handling.
  • workInterrupt is reset only in the constructor. After a normal InterruptWorkerThread() + Stop() cycle, its flag remains set. A subsequent Start() would launch housekeeping/dispatcher threads whose loops immediately exit because while (!workInterrupt) is false. If Start/Stop are intended to be one‑shot per lifetime (which matches typical usage via LLMQContext), it could help future readers to either:
    • Document this explicitly in a comment, or
    • Add an assertion that Start() is not called after Stop(), or
    • (If restarts are desired later) reset workInterrupt in Start() or after joining in Stop().
  • Stop() currently clears and stops the worker pool before joining housekeepingThread / dispatcherThread. While this is probably safe with the current ctpl::thread_pool implementation, joining the dispatcher/housekeeping threads first and then stopping the pool would make the lifetime relationship explicit (no more push() calls can race with stop(true)).

This is not a correctness blocker, but tightening/documenting the lifecycle would make the multithreaded setup easier to reason about and safer to evolve.


🌐 Web query:

Is `ctpl::thread_pool` documented as safe to call `push()` on concurrently with `stop(true)` and after `stop(true)` has been invoked?

💡 Result:

Short answer: No — CTPL does not document push() as safe to call concurrently with or after stop(true). The implementation shows no check that rejects pushes after stop/isDone, so pushes made while/after stop(true) can be lost or never executed.

Evidence:

  • The push() implementation unconditionally q.push(...) then notifies threads (no isStop/isDone check). [1].
  • stop(true) sets isDone, notifies, joins and then clears the queue and destroys threads; after stop(true) returns there are no worker threads to run later pushes. [1].
  • The README/README.md contains no API guarantee about thread-safety of push vs stop. [2].

Conclusion: avoid calling push() after or concurrently with stop(true). If you need that behavior, protect calls with your own synchronization or modify CTPL to reject pushes when stopping.

Sources:
[1] ctpl.h (implementation) — vit-vit/CTPL.
[2] CTPL README — vit-vit/CTPL.


I need to verify whether the housekeeping or dispatcher threads actually call push() on the worker pool. This will determine if the current shutdown order in Stop() poses a real race condition.




🏁 Script executed:

#!/bin/bash
# Search for HousekeepingThreadMain and WorkDispatcherThreadMain definitions,
# and check if they or called functions invoke push() on workerPool
rg -n "HousekeepingThreadMain|WorkDispatcherThreadMain" src/llmq/signing_shares.cpp -A 30

Length of output: 2229


🏁 Script executed:

#!/bin/bash
# Search for all push() calls within DispatchPendingSigns, ProcessPendingSigSharesLoop,
# and related dispatch/housekeeping functions to confirm if workerPool.push is used
rg -nP "workerPool\s*\.\s*push" src/llmq/signing_shares.cpp -B 2 -A 2

Length of output: 378


🏁 Script executed:

#!/bin/bash
# Check what functions are called from within ProcessPendingSigSharesLoop
# to trace if any of them indirectly call push()
rg -n "ProcessPendingSigSharesLoop|ProcessPendingSigShares" src/llmq/signing_shares.cpp -A 20

Length of output: 3231


🏁 Script executed:

#!/bin/bash
# Look for the complete DispatchPendingSigns function and any push() calls within it
rg -n "DispatchPendingSigns" src/llmq/signing_shares.cpp -A 40

Length of output: 1557


🏁 Script executed:

#!/bin/bash
# Check Dispatch* helper functions that may call push()
rg -n "Dispatch" src/llmq/signing_shares.cpp -B 1 -A 15 | head -100

Length of output: 2096


🏁 Script executed:

#!/bin/bash
# Get function signatures and context around workerPool usage
ast-grep --pattern 'workerPool.$_' src/llmq/signing_shares.cpp

Length of output: 354


🏁 Script executed:

#!/bin/bash
# Search CTPL documentation or source for any thread-safety notes
# First, check if CTPL is vendored or included in the repo
find . -name "ctpl*" -o -name "CTPL*" 2>/dev/null | head -20

Length of output: 16


Critical: Unsafe thread shutdown order causes lost work tasks

The current Stop() order violates CTPL's documented behavior and creates a race condition where queued tasks are lost:

  • DispatchPendingProcessing() (line 1681) calls workerPool.push() without checking workInterrupt, unlike DispatchPendingSigns().
  • Stop() calls clear_queue() + stop(true) (lines 228–229) before joining the dispatcher thread (line 237).
  • When InterruptWorkerThread() is called, the dispatcher thread may still be in DispatchPendingProcessing() and call push(). Since the pool is already stopped, that work is lost.

Fix: Join the threads before stopping the pool to ensure no push() calls can race with stop(true):

void CSigSharesManager::Stop()
{
    if (!workInterrupt) {
        assert(false);
    }

    // Join threads FIRST to stop any pending push() calls
    if (housekeepingThread.joinable()) {
        housekeepingThread.join();
    }
    if (dispatcherThread.joinable()) {
        dispatcherThread.join();
    }

    // Then stop worker pool (now safe, no more push() calls)
    workerPool.clear_queue();
    workerPool.stop(true);
}

Additionally, add if (workInterrupt) return; to DispatchPendingProcessing() before the push() call (line 1681) to match DispatchPendingSigns() and catch the interrupt earlier.

🤖 Prompt for AI Agents
In src/llmq/signing_shares.cpp around lines 199–239, Stop() currently clears and
stops the workerPool before joining the dispatcher/housekeeping threads which
can race with dispatcher calls to workerPool.push() and drop tasks; change
Stop() to join housekeepingThread and dispatcherThread first (if joinable) and
only after both are joined call workerPool.clear_queue() and
workerPool.stop(true). Also update DispatchPendingProcessing() (around line
~1681) to check if (workInterrupt) return; immediately before calling
workerPool.push() so it mirrors DispatchPendingSigns() and prevents pushing
after interrupt.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense... Thoughts @PastaPastaPasta ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@github-actions
Copy link

This pull request has conflicts, please rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants