-
Notifications
You must be signed in to change notification settings - Fork 1.2k
refactor: enhance CSigSharesManager with multi-threaded worker pool and dispatcher #7004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
refactor: enhance CSigSharesManager with multi-threaded worker pool and dispatcher #7004
Conversation
|
WalkthroughThe single dedicated worker thread in CSigSharesManager was replaced with a capped ctpl::thread_pool plus two dedicated threads: a WorkDispatcher thread and a Housekeeping thread. Lifecycle methods StartWorkerThread/StopWorkerThread were renamed to Start/Stop. A new public AsyncSign(quorum, id, msgHash) enqueues signing tasks. Internal flow was refactored into DispatchPendingSigns, DispatchPendingProcessing, ProcessPendingSigSharesLoop, and SignAndProcessSingleShare worker tasks. ctpl_stl.h was added and the base listener type changed to llmq::CRecoveredSigsListener. Sequence DiagramsequenceDiagram
participant Caller
participant Manager as CSigSharesManager
participant Dispatcher as WorkDispatcherThread
participant Pool as WorkerPool
participant Housekeeping as HousekeepingThread
Caller->>Manager: AsyncSign(quorum,id,msgHash)
activate Manager
Manager->>Manager: enqueue PendingSignatureData
deactivate Manager
Dispatcher->>Manager: WorkDispatcherThreadMain()
activate Dispatcher
Dispatcher->>Manager: DispatchPendingSigns()
activate Manager
Manager->>Pool: enqueue SignAndProcessSingleShare tasks
deactivate Manager
Dispatcher->>Manager: DispatchPendingProcessing()
activate Manager
Manager->>Pool: maybe spawn ProcessPendingSigSharesLoop
deactivate Manager
deactivate Dispatcher
Pool->>Pool: SignAndProcessSingleShare()
activate Pool
Note over Pool: create/process single sigShare\nupdate session state when ready
deactivate Pool
Pool->>Pool: ProcessPendingSigSharesLoop()
activate Pool
Note over Pool: loop ProcessPendingSigShares until none remain
deactivate Pool
Housekeeping->>Housekeeping: HousekeepingThreadMain()
activate Housekeeping
Note over Housekeeping: periodic RemoveBannedNodeStates,\nSendMessages, Cleanup
deactivate Housekeeping
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/llmq/signing_shares.cpp (1)
1-1: Critical: Fix code formatting issuesThe CI pipeline reports clang-format differences in this file. Please run
clang-formatto fix the formatting issues before merging.#!/bin/bash # Run clang-format on the file to see differences git diff -U0 --no-color HEAD^ HEAD -- src/llmq/signing_shares.cpp | clang-format-diff.py -p1
🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)
1665-1683: Consider spawn frequency for processing workersThe dispatcher calls
DispatchPendingProcessing()every 10ms. If pending work exists and processing takes longer than 10ms, this could queue multipleProcessPendingSigSharesLoop()tasks in the worker pool.While the fixed-size worker pool (1-4 workers) prevents unbounded thread creation, queuing many identical loop tasks could lead to redundant work since
ProcessPendingSigSharesLoop()will exit once it finds no more work.Consider adding a flag to prevent queuing additional processing tasks when one is already queued or running, or verify this pattern is intentional for load distribution.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/llmq/signing_shares.cpp(2 hunks)src/llmq/signing_shares.h(5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
📚 Learning: 2025-09-09T21:36:11.833Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6837
File: src/stats/rawsender.cpp:0-0
Timestamp: 2025-09-09T21:36:11.833Z
Learning: In RawSender class (src/stats/rawsender.cpp), cs_net is the appropriate mutex for protecting socket access (m_sock) and network operations, not additional custom locks. The implementation correctly uses cs_net with GUARDED_BY annotations and EXCLUSIVE_LOCKS_REQUIRED to synchronize socket access between SendDirectly() and ReconnectThread().
Applied to files:
src/llmq/signing_shares.h
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.
Applied to files:
src/llmq/signing_shares.h
🧬 Code graph analysis (2)
src/llmq/signing_shares.h (2)
src/llmq/signing.h (1)
CRecoveredSigsListener(158-164)src/masternode/active/context.cpp (4)
Start(54-59)Start(54-54)Stop(61-66)Stop(61-61)
src/llmq/signing_shares.cpp (3)
src/llmq/context.cpp (4)
Start(50-55)Start(50-50)Stop(57-62)Stop(57-57)src/llmq/quorums.cpp (4)
Start(237-243)Start(237-237)Stop(245-250)Stop(245-245)src/masternode/active/context.cpp (4)
Start(54-59)Start(54-54)Stop(61-66)Stop(61-61)
🪛 GitHub Actions: Clang Diff Format Check
src/llmq/signing_shares.cpp
[error] 1-1: Clang-format differences found in src/llmq/signing_shares.cpp. Run the clang-format-diff.py tool (or clang-format) to fix code formatting. Command diff step exited with code 1 due to differences.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: linux64_tsan-build / Build source
- GitHub Check: x86_64-w64-mingw32 / Build depends
- GitHub Check: x86_64-apple-darwin / Build depends
- GitHub Check: Lint / Run linters
- GitHub Check: arm-linux-gnueabihf / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
- GitHub Check: x86_64-pc-linux-gnu_nowallet / Build depends
🔇 Additional comments (14)
src/llmq/signing_shares.h (6)
9-9: LGTM: Thread pool library inclusionThe ctpl_stl.h include is necessary for the worker pool architecture introduced in this refactor.
364-364: LGTM: Explicit namespace qualificationMaking the namespace explicit improves code clarity without changing functionality.
383-386: LGTM: Thread pool architectureThe shift from a single
workThreadto aworkerPoolwith dedicatedhousekeepingThreadanddispatcherThreadaligns with the PR objectives to enable parallel sigShare signing.
439-440: LGTM: New public API for async signingThe new
AsyncSign()method provides a cleaner interface for enqueueing signing tasks to the worker pool. The lock annotations correctly specify that callers must not hold eithercs_pendingSignsorcs.
398-404: LGTM: Data structure for worker tasksThe
PendingSignatureDatastruct appropriately captures the data needed for each signing task. Using const members helps ensure thread safety.
503-513: LGTM: Well-structured dispatcher and worker methodsThe new thread entry points and dispatcher methods are properly declared with correct lock annotations. The separation of concerns (housekeeping, dispatching, and worker tasks) is clear and maintainable.
src/llmq/signing_shares.cpp (8)
220-239: LGTM: Proper thread pool shutdown sequenceThe
Stop()method correctly:
- Clears the worker pool queue to prevent new tasks from executing
- Stops the pool with
wait=trueto allow in-flight tasks to complete- Joins threads only if they're joinable
The requirement to call
InterruptWorkerThread()beforeStop()is documented and enforced (line 222-225).
1614-1630: LGTM: Housekeeping thread implementationThe housekeeping thread correctly handles:
- Cleaning up banned node states
- Rate-limiting message sending to every 100ms
- Running periodic cleanup
- Clean shutdown via
workInterrupt
1632-1644: LGTM: Dispatcher thread implementationThe dispatcher thread appropriately:
- Dispatches all pending sign tasks to the worker pool
- Conditionally spawns workers when there's incoming signature share processing work
- Uses a 10ms sleep interval for responsive dispatching
- Handles clean shutdown via
workInterrupt
1646-1663: LGTM: Efficient pending sign dispatchThe method correctly drains all pending signs from the queue and dispatches them to the worker pool. The use of
std::optionaland move semantics ensures efficient data transfer without copies.
1685-1694: LGTM: Efficient processing loopThe loop correctly processes batches of pending signature shares until no work remains, then exits immediately to free the worker for other tasks.
1696-1713: LGTM: Single share signing and processingThe method correctly:
- Creates a signature share
- Validates it before processing
- Updates the signed sessions when all-members-connected is enabled
- Acquires appropriate locks for shared state updates
1715-1719: LGTM: Clean async signing APIThe
AsyncSign()method provides a simple, thread-safe interface for enqueueing signing tasks. The use of move semantics for the quorum pointer is appropriate.
199-218: Remove incorrect comparison and clarify that this pattern is not widely establishedThe cited comparison to
CQuorumManager::Start()in src/llmq/quorums.cpp:236-242 is incorrect—that function does not contain a joinable/assert(false) pattern. While similar assert(false) checks on joinable() conditions exist in src/instantsend/net_instantsend.cpp and src/llmq/signing.cpp, this pattern is not a widespread established practice in the codebase.The underlying concern about assert(false) having no effect in release builds with NDEBUG remains valid, but the justification claiming this follows an established pattern should be removed.
Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/llmq/signing_shares.cpp (4)
199-239: Start/Stop lifecycle for worker pool and helper threads is reasonable; consider documenting/guarding single-use assumptionThe new
Start()/Stop()correctly:
- Prevent starting when housekeeping/dispatcher threads are already joinable.
- Size and name the worker pool (
1–4threads, ~half ofhardware_concurrency()).- Launch a housekeeping thread and a dispatcher thread wrapped in
util::TraceThread.- On
Stop(), requireworkInterruptto have been triggered, then clear the queue, stop the pool, and join both helper threads.This looks sound given the existing contract that
InterruptWorkerThread()is called beforeStop(). One subtle point:Start()doesn’t resetworkInterrupt, so this instance is effectively intended to have a single Start/Interrupt/Stop lifecycle. If you ever expect multiple Start/Stop cycles on the sameCSigSharesManagerinstance, you’ll need either aworkInterrupt.reset()(plus appropriate asserts) inStart()or an explicit comment asserting single-use semantics.
1639-1687: DispatchPendingSigns / DispatchPendingProcessing: concurrency model is safe; optional guard to limit helper loop fan‑outThe new scheduling looks correct:
DispatchPendingSigns()atomically swapspendingSignsundercs_pendingSignsand enqueues oneSignAndProcessSingleSharetask per pending item intoworkerPool, breaking early ifworkInterruptis set.DispatchPendingProcessing()checks undercswhether anypendingIncomingSigSharesremain and, when true, enqueues aProcessPendingSigSharesLoop()helper task.ProcessPendingSigSharesLoop()repeatedly callsProcessPendingSigShares()until it reports no more work orworkInterruptbecomes true;ProcessPendingSigShares()itself is internally synchronized withcs.Functionally this allows multiple workers to help drain the verification/processing backlog without data races. The only potential concern is that
DispatchPendingProcessing()can enqueue additionalProcessPendingSigSharesLoop()tasks every 10 ms while work exists, even if one or more such helpers are already running; extra helpers will quickly exit when they find no more work, but they still consume queue entries.If you see worker-pool saturation or unnecessary task churn in profiling, consider adding a lightweight guard (e.g., an atomic “processing loop in flight” flag cleared when
ProcessPendingSigSharesLoop()exits) so at most one such loop runs at a time, relying on that loop to drain the backlog. This is an optional optimization; the current design is correct but may be slightly more chatty in very busy scenarios.
1689-1712: AsyncSign + SignAndProcessSingleShare look correct; minor perf and thread‑safety considerationsThe end-to-end path:
AsyncSign()queuesPendingSignatureDataundercs_pendingSigns.- The dispatcher swaps
pendingSignsout and enqueuesSignAndProcessSingleShare()tasks intoworkerPool.- Each worker calls
CreateSigShare(*work.quorum, work.id, work.msgHash)and, on success, feeds the result intoProcessSigShare(), updatingsignedSessionsundercsfor all-members-connected mode.This is logically sound and respects existing locking (
cs_pendingSignsandcs). Two small points:
CreateSigShare()already verifies the produced signature and only returns astd::optional<CSigShare>on success, so the extraopt_sigShare->sigShare.Get().IsValid()check redoes expensive BLS work. You could rely onCreateSigShare()’s contract (or move the validity check there exclusively) to avoid this redundant verification on a hot path.BLS signing is now happening concurrently across multiple worker threads. That’s the goal of this refactor, but it does assume the underlying BLS implementation (including
m_mn_activeman.Sign()andCBLSSecretKey::Sign()) is thread-safe for concurrent use. It likely is, but it’s worth double-checking with library documentation or stress tests.
1716-1754: CreateSigShare membership checks and single‑member path are solid; logging format could be clarifiedThe updated
CreateSigShare():
- Fetches
activeMasterNodeProTxHashonce and immediately returnsstd::nulloptifquorum.IsValidMember()is false, preventing accidental signing when this node isn’t a valid member.- Handles the
is_single_member()case explicitly by:
- Deriving
memberIdxconsistently fromGetMemberIndex().- Signing with
m_mn_activeman.Sign()and validating the resulting lazy BLS signature before returning.- Keeps the multi-member path using
GetSkShare()with a validity check, then signs and validates similarly.This tightens correctness around membership and makes the single-member quorum behavior explicit. One minor nit: both failure and success logs here use
time=%switht.count()as the corresponding argument, while other timers in this file use an integer format specifier (e.g.,%d). For consistency and to avoid confusion about the underlying type, consider switching these format specifiers to match the rest of the file (e.g.,%dor another numeric specifier appropriate forcxxtimer::Timer::count()).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/llmq/signing_shares.cpp(2 hunks)src/masternode/active/context.cpp(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.
Applied to files:
src/masternode/active/context.cpp
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: In src/net_processing.cpp, if ActiveContext (m_active_ctx) is non-null, its members (including cj_server) are guaranteed to be initialized; call sites can safely dereference m_active_ctx->cj_server without an additional null-check.
Applied to files:
src/masternode/active/context.cpp
🧬 Code graph analysis (2)
src/masternode/active/context.cpp (2)
src/llmq/signing_shares.cpp (2)
Stop(220-239)Stop(220-220)src/llmq/context.cpp (2)
Stop(57-62)Stop(57-57)
src/llmq/signing_shares.cpp (1)
src/llmq/context.cpp (4)
Start(50-55)Start(50-50)Stop(57-62)Stop(57-57)
🪛 GitHub Actions: Clang Diff Format Check
src/llmq/signing_shares.cpp
[error] 1-1: Clang format differences detected by clang-format-diff.py. Please run the formatting script to update src/llmq/signing_shares.cpp (diff shown in logs) and re-run CI.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: linux64_tsan-build / Build source
- GitHub Check: Lint / Run linters
- GitHub Check: x86_64-w64-mingw32 / Build depends
- GitHub Check: x86_64-apple-darwin / Build depends
- GitHub Check: x86_64-pc-linux-gnu_nowallet / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
- GitHub Check: arm-linux-gnueabihf / Build depends
🔇 Additional comments (3)
src/masternode/active/context.cpp (1)
54-66: CSigSharesManager Start/Stop usage and ordering look correctSwapping
StartWorkerThread()/StopWorkerThread()forStart()/Stop()here keeps the lifecycle and ordering intact: you still start QDKGS threads, then register as a recovered-sigs listener, then startshareman, and reverse that on shutdown. WithInterrupt()still delegating toshareman->InterruptWorkerThread(), the new multi-threaded manager should integrate cleanly with the existing ActiveContext shutdown flow.src/llmq/signing_shares.cpp (2)
1614-1637: Housekeeping and dispatcher threads cleanly separate concernsMoving
RemoveBannedNodeStates(),SendMessages(), andCleanup()intoHousekeepingThreadMain()and using a separateWorkDispatcherThreadMain()to drive signing/processing work cleanly decouples network/cleanup duties from BLS signing. Both loops respectworkInterruptand usesleep_forto avoid busy-waiting, which should behave similarly to the old single-thread loop but without blocking signing on slow networking or cleanup.Assuming there are no other call sites invoking these methods concurrently (which matches what’s visible here), this separation looks correct.
199-239: Verify formatting by running clang-format locally on this fileThe file
src/llmq/signing_shares.cppis subject to the clang-format check (matched bysrc/llmq/*.cppin the non-backported.txt list). The code as currently shown appears properly formatted, but to confirm the CI formatting check will pass, please run the project's clang-format tooling locally and verify no changes are reported:./contrib/devtools/clang-format-diff.py -p1 < (git diff origin/develop -- src/llmq/signing_shares.cpp)If this produces no output, the formatting is correct and the comment can be resolved.
|
It was intentional to loop here to not force us to have the 10ms wait, and allow multiple workers to process batches simultaneously / in parallel. This would in practice limit us to just 1 worker at a time. |
|
Makes sense. That's not how we use worker pool usually (providing specific pieces to work on) but it should work I guess 👍 |
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 228e568
|
nit: could maybe apply clang-format suggestions |
|
This pull request has conflicts, please rebase. |
… thread safety - Removed unnecessary lastSendTime variable and simplified message sending in HousekeepingThreadMain. - Enhanced DispatchPendingSigns by swapping the entire vector of pending signs to reduce lock contention and improve performance. - Updated ActiveContext to start and stop the share manager more efficiently.
228e568 to
b8e0762
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)
1679-1683: Consider tracking in-flight processing workers to avoid redundant tasks.The current implementation spawns a new worker every 10ms if work exists, even if a previously spawned worker is still processing. While the fixed-size pool prevents unbounded resource usage, this could queue many redundant tasks if work persists.
Consider adding a flag to track whether a processing worker is already active:
+ std::atomic<bool> processingActive{false}; + void CSigSharesManager::DispatchPendingProcessing() { + if (processingActive.load()) { + return; // Already processing + } + // Check if there's work, spawn a helper if so bool hasWork = false; { LOCK(cs); hasWork = std::any_of(nodeStates.begin(), nodeStates.end(), [](const auto& entry) { return !entry.second.pendingIncomingSigShares.Empty(); }); } if (hasWork) { + processingActive.store(true); // Work exists - spawn a worker to help! workerPool.push([this](int) { + processingActive.store(false); ProcessPendingSigSharesLoop(); + processingActive.store(false); }); } }Note: This optimization was discussed in past review comments and deemed acceptable as-is.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/llmq/signing_shares.cpp(2 hunks)src/llmq/signing_shares.h(5 hunks)src/masternode/active/context.cpp(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/masternode/active/context.cpp
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.{cpp,h,hpp,cc}
📄 CodeRabbit inference engine (CLAUDE.md)
Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1
Files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
src/{masternode,evo,llmq,governance,coinjoin}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Dash-specific database implementations: CFlatDB for persistent storage (MasternodeMetaStore, GovernanceStore, SporkStore, NetFulfilledRequestStore) and CDBWrapper extensions for Evolution/DKG/InstantSend/Quorum/RecoveredSigs data
Files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
src/{masternode,llmq}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
BLS integration must be used for cryptographic foundation of advanced masternode features
Files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
src/llmq/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
src/llmq/**/*.{cpp,h}: LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention
Files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
src/{masternode,llmq,evo,coinjoin,governance}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
Use unordered_lru_cache for efficient caching with LRU eviction in Dash-specific data structures
Files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
🧠 Learnings (7)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
Applied to files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
Applied to files:
src/llmq/signing_shares.cppsrc/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/{masternode,llmq}/**/*.{cpp,h} : BLS integration must be used for cryptographic foundation of advanced masternode features
Applied to files:
src/llmq/signing_shares.h
📚 Learning: 2025-11-24T16:41:22.445Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.445Z
Learning: Applies to src/llmq/**/*.{cpp,h} : ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention
Applied to files:
src/llmq/signing_shares.h
📚 Learning: 2025-09-09T21:36:11.833Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6837
File: src/stats/rawsender.cpp:0-0
Timestamp: 2025-09-09T21:36:11.833Z
Learning: In RawSender class (src/stats/rawsender.cpp), cs_net is the appropriate mutex for protecting socket access (m_sock) and network operations, not additional custom locks. The implementation correctly uses cs_net with GUARDED_BY annotations and EXCLUSIVE_LOCKS_REQUIRED to synchronize socket access between SendDirectly() and ReconnectThread().
Applied to files:
src/llmq/signing_shares.h
📚 Learning: 2025-10-02T18:29:54.756Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6840
File: src/net_processing.cpp:2882-2886
Timestamp: 2025-10-02T18:29:54.756Z
Learning: Across net_processing.cpp, once LLMQContext (m_llmq_ctx) is asserted non-null, its subcomponents (e.g., isman, qdkgsman, quorum_block_processor) are treated as initialized and used without extra null checks.
Applied to files:
src/llmq/signing_shares.h
🧬 Code graph analysis (2)
src/llmq/signing_shares.cpp (2)
src/llmq/context.cpp (4)
Start(50-55)Start(50-50)Stop(57-62)Stop(57-57)src/llmq/quorums.cpp (4)
Start(237-243)Start(237-237)Stop(245-250)Stop(245-245)
src/llmq/signing_shares.h (2)
src/llmq/context.h (1)
llmq(22-31)src/llmq/signing.h (1)
CRecoveredSigsListener(158-164)
🔇 Additional comments (9)
src/llmq/signing_shares.h (3)
364-364: LGTM: Explicit namespace qualification improves clarity.Making the base class namespace explicit is a good practice and aligns with the class definition in
src/llmq/signing.h.
431-432: Lifecycle method renaming aligns with codebase patterns.The renaming from
StartWorkerThread()/StopWorkerThread()toStart()/Stop()is consistent withCQuorumManager's lifecycle methods and was already addressed per past review comments.
9-9: ✓ ctpl_stl.h is correctly available and integrated in the repository.The CTPL thread pool library is integrated into Dash Core, and the file
src/ctpl_stl.hexists in the repository. The inclusion of#include <ctpl_stl.h>is a standard and valid approach for this header-only thread pool library.src/llmq/signing_shares.cpp (6)
207-209: LGTM: Worker pool sizing matches codebase patterns.The worker pool initialization using
hardware_concurrency() / 2clamped to [1, 4] workers is consistent withCQuorumManager::Start()implementation.
1623-1632: LGTM: Housekeeping thread performs periodic maintenance correctly.The housekeeping thread properly loops with interrupt checking and reasonable 100ms sleep interval for maintenance tasks.
1634-1646: LGTM: Dispatcher thread enables parallel batch processing.The 10ms sleep interval is intentionally short (vs. housekeeping's 100ms) to allow multiple workers to process batches in parallel, as explained by the PR author.
1648-1665: LGTM: Efficient lock contention reduction through vector swap.The swap pattern minimizes lock hold time and the move semantics avoid unnecessary copies when dispatching work to the pool.
1687-1696: LGTM: Work-stealing loop correctly processes batches until completion.The loop continues processing batches while work remains and exits immediately when none is found, avoiding unnecessary spinning.
1698-1721: LGTM: Sign and async enqueue implementations are correct.Both
SignAndProcessSingleShare()andAsyncSign()properly handle signature creation, validation, and thread-safe enqueueing with appropriate lock usage.
UdjinM6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK b8e0762
src/llmq/signing_shares.cpp
Outdated
| [this] { HousekeepingThreadMain(); }); | ||
|
|
||
| // Start dispatcher thread | ||
| dispatcherThread = std::thread(&util::TraceThread, "sigsh-dispatch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not sure, but thread names likely is limited to 12 bytes, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/llmq/signing_shares.cpp (1)
1698-1721: Remove redundant BLS validity check inSignAndProcessSingleShareBoth
CreateSigShare()(lines 1789) andCreateSigShareForSingleMember()(lines 1742) validate the signature withGet().IsValid()and returnstd::nulloptif invalid. When they return a non-empty optional, the signature is already guaranteed valid. Re-checking withGet().IsValid()inSignAndProcessSingleShare()(lines 1702–1703) is redundant and incurs an unnecessary mutex lock andGet()call.Simplify to:
auto opt_sigShare = CreateSigShare(*work.quorum, work.id, work.msgHash); - if (opt_sigShare.has_value() && opt_sigShare->sigShare.Get().IsValid()) { + if (opt_sigShare) { auto sigShare = std::move(*opt_sigShare);This preserves behavior while eliminating the second validation check.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/llmq/signing_shares.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.{cpp,h,hpp,cc}
📄 CodeRabbit inference engine (CLAUDE.md)
Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1
Files:
src/llmq/signing_shares.cpp
src/{masternode,evo,llmq,governance,coinjoin}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Dash-specific database implementations: CFlatDB for persistent storage (MasternodeMetaStore, GovernanceStore, SporkStore, NetFulfilledRequestStore) and CDBWrapper extensions for Evolution/DKG/InstantSend/Quorum/RecoveredSigs data
Files:
src/llmq/signing_shares.cpp
src/{masternode,llmq}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
BLS integration must be used for cryptographic foundation of advanced masternode features
Files:
src/llmq/signing_shares.cpp
src/llmq/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
src/llmq/**/*.{cpp,h}: LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
InstantSend implementation must provide distributed key generation for secure transaction locking with quorum consensus
ChainLocks implementation must prevent reorganizations and provide block finality through 51% attack prevention
Files:
src/llmq/signing_shares.cpp
src/{masternode,llmq,evo,coinjoin,governance}/**/*.{cpp,h}
📄 CodeRabbit inference engine (CLAUDE.md)
Use unordered_lru_cache for efficient caching with LRU eviction in Dash-specific data structures
Files:
src/llmq/signing_shares.cpp
🧠 Learnings (14)
📓 Common learnings
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Learnt from: kwvg
Repo: dashpay/dash PR: 6543
File: src/wallet/receive.cpp:240-251
Timestamp: 2025-02-06T14:34:30.466Z
Learning: Pull request #6543 is focused on move-only changes and refactoring, specifically backporting from Bitcoin. Behavior changes should be proposed in separate PRs.
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Applies to src/llmq/**/*.{cpp,h} : LLMQ quorums must support multiple configurations (50/60, 400/60, 400/85) for different services (ChainLocks, InstantSend, governance voting)
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-07-29T14:32:48.369Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6761
File: src/chainlock/signing.cpp:247-250
Timestamp: 2025-07-29T14:32:48.369Z
Learning: In PR #6761, kwvg acknowledged a null pointer check issue in ChainLockSigner::Cleanup() method but deferred it to follow-up, consistent with the pattern of avoiding scope creep in refactoring PRs.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T19:52:02.686Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 7004
File: src/llmq/signing_shares.cpp:201-204
Timestamp: 2025-11-24T19:52:02.686Z
Learning: In Dash Core, assertions cannot be disabled during compilation. The codebase always has assertions enabled, so concerns about release builds continuing past assert statements are not applicable.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-07-17T15:48:29.418Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6752
File: src/wallet/load.cpp:164-164
Timestamp: 2025-07-17T15:48:29.418Z
Learning: In Dash Core, kwvg prefers using assert() statements to document expected behavior in code paths, especially when the called function also has internal asserts. This is used as a defensive programming pattern rather than requiring conditional error handling.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Dash extends Bitcoin Core through composition with minimal changes to the Bitcoin Core foundation
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-11-24T16:41:22.457Z
Learnt from: CR
Repo: dashpay/dash PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T16:41:22.457Z
Learning: Applies to src/**/*.{cpp,h,hpp,cc} : Dash Core implementation must be written in C++20, requiring at least Clang 16 or GCC 11.1
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-11-04T18:24:27.241Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6933
File: src/llmq/utils.cpp:284-298
Timestamp: 2025-11-04T18:24:27.241Z
Learning: In consensus-critical code (such as quorum formation, block validation, or deployment activation logic), do not suggest changes to the logic itself even if the implementation appears theoretically incorrect or off-by-one. Consensus rules, once deployed on the Dash network, must be preserved exactly to avoid network forks. Refactoring PRs should maintain perfect behavioral equivalence. Only suggest logic changes if explicitly accompanied by a DIP (Dash Improvement Proposal) or if the maintainer indicates the consensus rule needs to be changed with appropriate activation logic.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-10-05T20:38:28.457Z
Learnt from: knst
Repo: dashpay/dash PR: 6871
File: contrib/guix/libexec/build.sh:358-360
Timestamp: 2025-10-05T20:38:28.457Z
Learning: In the Dash repository, when backporting code from Bitcoin Core, typos and minor issues in comments should be kept as-is to reduce merge conflicts in future backports, even if they remain unfixed in Bitcoin Core's master branch.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-08-01T07:46:37.840Z
Learnt from: UdjinM6
Repo: dashpay/dash PR: 6786
File: ci/test/04_install.sh:99-101
Timestamp: 2025-08-01T07:46:37.840Z
Learning: In backport PRs like #6786, UdjinM6 prefers to defer non-critical fixes (such as shell command expansion issues) to separate commits/PRs to maintain focus on the primary backport objectives, consistent with the project's pattern of avoiding scope creep.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-07-15T14:53:04.819Z
Learnt from: knst
Repo: dashpay/dash PR: 6691
File: src/test/llmq_params_tests.cpp:148-151
Timestamp: 2025-07-15T14:53:04.819Z
Learning: In the Dash Core LLMQ implementation, signingActiveQuorumCount is never 0 in the actual parameters defined in params.h, making division by zero scenarios unrealistic in the max_cycles() function.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-06-09T16:43:20.996Z
Learnt from: kwvg
Repo: dashpay/dash PR: 6718
File: test/functional/test_framework/test_framework.py:2102-2102
Timestamp: 2025-06-09T16:43:20.996Z
Learning: In the test framework consolidation PR (#6718), user kwvg prefers to limit functional changes to those directly related to MasternodeInfo, avoiding scope creep even for minor improvements like error handling consistency.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-08-19T14:57:31.801Z
Learnt from: knst
Repo: dashpay/dash PR: 6692
File: src/llmq/blockprocessor.cpp:217-224
Timestamp: 2025-08-19T14:57:31.801Z
Learning: In PR #6692, knst acknowledged a null pointer dereference issue in ProcessBlock() method where LookupBlockIndex may return nullptr but is passed to gsl::not_null, and created follow-up PR #6789 to address it, consistent with avoiding scope creep in performance-focused PRs.
Applied to files:
src/llmq/signing_shares.cpp
📚 Learning: 2025-10-13T12:37:12.357Z
Learnt from: knst
Repo: dashpay/dash PR: 6883
File: src/rpc/rawtransaction.cpp:1088-1125
Timestamp: 2025-10-13T12:37:12.357Z
Learning: In backport pull requests (especially from Bitcoin Core), treat "moved" or refactored code as out-of-scope for content-level review. Focus validation on verifying that code is moved correctly: no fields added, no fields removed, no fields reordered, and no unexpected changes beyond whitespace adjustments. Pre-existing issues in the upstream code should be preserved to maintain fidelity to the original implementation.
Applied to files:
src/llmq/signing_shares.cpp
🧬 Code graph analysis (1)
src/llmq/signing_shares.cpp (1)
src/llmq/context.cpp (4)
Start(50-55)Start(50-50)Stop(57-62)Stop(57-57)
🔇 Additional comments (1)
src/llmq/signing_shares.cpp (1)
1623-1695: Housekeeping and dispatcher threading model looks sound and race‑freeThe split into:
HousekeepingThreadMain()doingRemoveBannedNodeStates(),SendMessages(), andCleanup()with a 100 ms sleep (Lines 1623‑1631), andWorkDispatcherThreadMain()looping overDispatchPendingSigns()andDispatchPendingProcessing()with a 10 ms sleep (Lines 1634‑1645),is consistent with the existing locking scheme:
- All accesses to
nodeStates,pendingIncomingSigShares, and related maps remain protected bycsinsideDispatchPendingProcessing()and theProcessPendingSigShares*helpers (Lines 1669‑1677, 1689‑1693).- The signing queue
pendingSignsis only touched undercs_pendingSignsinDispatchPendingSigns()andAsyncSign(), avoiding races between producers and the dispatcher (Lines 1651‑1655, 1719‑1721).ProcessPendingSigSharesLoop()loops onProcessPendingSigShares()until it reports no additional work (Lines 1687‑1693), which matches the existing batching semantics while allowing multiple workers to process batches in parallel.Given the existing lock usage and the one‑way signaling via
workInterrupt, the new thread/dispatcher wiring does not introduce obvious data races and should give the intended parallelism for BLS verification and signing.
| void CSigSharesManager::Start() | ||
| { | ||
| // can't start new thread if we have one running already | ||
| if (workThread.joinable()) { | ||
| // can't start if threads are already running | ||
| if (housekeepingThread.joinable() || dispatcherThread.joinable()) { | ||
| assert(false); | ||
| } | ||
|
|
||
| workThread = std::thread(&util::TraceThread, "sigshares", [this] { WorkThreadMain(); }); | ||
| // Initialize worker pool | ||
| int workerCount = std::clamp(static_cast<int>(std::thread::hardware_concurrency() / 2), 1, 4); | ||
| workerPool.resize(workerCount); | ||
| RenameThreadPool(workerPool, "sigsh-work"); | ||
|
|
||
| // Start housekeeping thread | ||
| housekeepingThread = std::thread(&util::TraceThread, "sigsh-maint", | ||
| [this] { HousekeepingThreadMain(); }); | ||
|
|
||
| // Start dispatcher thread | ||
| dispatcherThread = std::thread(&util::TraceThread, "sigsh-dispat", | ||
| [this] { WorkDispatcherThreadMain(); }); | ||
| } | ||
|
|
||
| void CSigSharesManager::StopWorkerThread() | ||
| void CSigSharesManager::Stop() | ||
| { | ||
| // make sure to call InterruptWorkerThread() first | ||
| if (!workInterrupt) { | ||
| assert(false); | ||
| } | ||
|
|
||
| if (workThread.joinable()) { | ||
| workThread.join(); | ||
| // Stop worker pool | ||
| workerPool.clear_queue(); | ||
| workerPool.stop(true); | ||
|
|
||
| // Join threads | ||
| if (housekeepingThread.joinable()) { | ||
| housekeepingThread.join(); | ||
| } | ||
|
|
||
| if (dispatcherThread.joinable()) { | ||
| dispatcherThread.join(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Clarify Start/Stop one‑shot semantics and consider workerPool shutdown ordering
- Using
assert(false)when threads are already running (Line 201) or whenStop()is called without a priorInterruptWorkerThread()(Line 223) is fine in Dash Core, since assertions are always enabled and will terminate execution on misuse. No additionalreturnis needed here. Based on learnings, this is an accepted defensive pattern rather than recoverable error handling. workInterruptis reset only in the constructor. After a normalInterruptWorkerThread()+Stop()cycle, its flag remains set. A subsequentStart()would launch housekeeping/dispatcher threads whose loops immediately exit becausewhile (!workInterrupt)is false. If Start/Stop are intended to be one‑shot per lifetime (which matches typical usage viaLLMQContext), it could help future readers to either:- Document this explicitly in a comment, or
- Add an assertion that
Start()is not called afterStop(), or - (If restarts are desired later) reset
workInterruptinStart()or after joining inStop().
Stop()currently clears and stops the worker pool before joininghousekeepingThread/dispatcherThread. While this is probably safe with the currentctpl::thread_poolimplementation, joining the dispatcher/housekeeping threads first and then stopping the pool would make the lifetime relationship explicit (no morepush()calls can race withstop(true)).
This is not a correctness blocker, but tightening/documenting the lifecycle would make the multithreaded setup easier to reason about and safer to evolve.
🌐 Web query:
Is `ctpl::thread_pool` documented as safe to call `push()` on concurrently with `stop(true)` and after `stop(true)` has been invoked?
💡 Result:
Short answer: No — CTPL does not document push() as safe to call concurrently with or after stop(true). The implementation shows no check that rejects pushes after stop/isDone, so pushes made while/after stop(true) can be lost or never executed.
Evidence:
- The push() implementation unconditionally q.push(...) then notifies threads (no isStop/isDone check). [1].
- stop(true) sets isDone, notifies, joins and then clears the queue and destroys threads; after stop(true) returns there are no worker threads to run later pushes. [1].
- The README/README.md contains no API guarantee about thread-safety of push vs stop. [2].
Conclusion: avoid calling push() after or concurrently with stop(true). If you need that behavior, protect calls with your own synchronization or modify CTPL to reject pushes when stopping.
Sources:
[1] ctpl.h (implementation) — vit-vit/CTPL.
[2] CTPL README — vit-vit/CTPL.
I need to verify whether the housekeeping or dispatcher threads actually call push() on the worker pool. This will determine if the current shutdown order in Stop() poses a real race condition.
🏁 Script executed:
#!/bin/bash
# Search for HousekeepingThreadMain and WorkDispatcherThreadMain definitions,
# and check if they or called functions invoke push() on workerPool
rg -n "HousekeepingThreadMain|WorkDispatcherThreadMain" src/llmq/signing_shares.cpp -A 30Length of output: 2229
🏁 Script executed:
#!/bin/bash
# Search for all push() calls within DispatchPendingSigns, ProcessPendingSigSharesLoop,
# and related dispatch/housekeeping functions to confirm if workerPool.push is used
rg -nP "workerPool\s*\.\s*push" src/llmq/signing_shares.cpp -B 2 -A 2Length of output: 378
🏁 Script executed:
#!/bin/bash
# Check what functions are called from within ProcessPendingSigSharesLoop
# to trace if any of them indirectly call push()
rg -n "ProcessPendingSigSharesLoop|ProcessPendingSigShares" src/llmq/signing_shares.cpp -A 20Length of output: 3231
🏁 Script executed:
#!/bin/bash
# Look for the complete DispatchPendingSigns function and any push() calls within it
rg -n "DispatchPendingSigns" src/llmq/signing_shares.cpp -A 40Length of output: 1557
🏁 Script executed:
#!/bin/bash
# Check Dispatch* helper functions that may call push()
rg -n "Dispatch" src/llmq/signing_shares.cpp -B 1 -A 15 | head -100Length of output: 2096
🏁 Script executed:
#!/bin/bash
# Get function signatures and context around workerPool usage
ast-grep --pattern 'workerPool.$_' src/llmq/signing_shares.cppLength of output: 354
🏁 Script executed:
#!/bin/bash
# Search CTPL documentation or source for any thread-safety notes
# First, check if CTPL is vendored or included in the repo
find . -name "ctpl*" -o -name "CTPL*" 2>/dev/null | head -20Length of output: 16
Critical: Unsafe thread shutdown order causes lost work tasks
The current Stop() order violates CTPL's documented behavior and creates a race condition where queued tasks are lost:
DispatchPendingProcessing()(line 1681) callsworkerPool.push()without checkingworkInterrupt, unlikeDispatchPendingSigns().Stop()callsclear_queue()+stop(true)(lines 228–229) before joining the dispatcher thread (line 237).- When
InterruptWorkerThread()is called, the dispatcher thread may still be inDispatchPendingProcessing()and callpush(). Since the pool is already stopped, that work is lost.
Fix: Join the threads before stopping the pool to ensure no push() calls can race with stop(true):
void CSigSharesManager::Stop()
{
if (!workInterrupt) {
assert(false);
}
// Join threads FIRST to stop any pending push() calls
if (housekeepingThread.joinable()) {
housekeepingThread.join();
}
if (dispatcherThread.joinable()) {
dispatcherThread.join();
}
// Then stop worker pool (now safe, no more push() calls)
workerPool.clear_queue();
workerPool.stop(true);
}Additionally, add if (workInterrupt) return; to DispatchPendingProcessing() before the push() call (line 1681) to match DispatchPendingSigns() and catch the interrupt earlier.
🤖 Prompt for AI Agents
In src/llmq/signing_shares.cpp around lines 199–239, Stop() currently clears and
stops the workerPool before joining the dispatcher/housekeeping threads which
can race with dispatcher calls to workerPool.push() and drop tasks; change
Stop() to join housekeepingThread and dispatcherThread first (if joinable) and
only after both are joined call workerPool.clear_queue() and
workerPool.stop(true). Also update DispatchPendingProcessing() (around line
~1681) to check if (workInterrupt) return; immediately before calling
workerPool.push() so it mirrors DispatchPendingSigns() and prevents pushing
after interrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it makes sense... Thoughts @PastaPastaPasta ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
This pull request has conflicts, please rebase. |
Issue being fixed or feature implemented
I don't have good data as to if this results in improvement. However, when sigShare BLS signing (if ever) takes more than a full thread, we will be bottlenecked.
This PR addresses two concerns
This should prevent the network sending or cleanup logic from slowing down the creation of sigShares
This should enable scaling sigShare creation past 1 full thread, and currently could use up to half of available cores for sigShare signing.
How Has This Been Tested?
This hadn't been tested; please review, but I'll want to deploy this on testnet before merging I think.
Breaking Changes
Checklist:
Go over all the following points, and put an
xin all the boxes that apply.