Skip to content

sweep: fix expected spending events being missed #10060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yyforyongyu
Copy link
Member

Fix the issue #10051. What happened there was,

  • a sweeping tx was created during a restart, which put CommitmentAnchor and CommitmentTimeLock in the same group, while the anchor has already been spent.
  • This is usually fine as we would detect that the anchor input is spent, and retry sweeping the to_local output.

However, from the logs there, the spending event was not notified quickly enough here, causing us to think there's no spent of the anchor input,

lnd/sweep/fee_bumper.go

Lines 1433 to 1453 in ea32aac

// Do a non-blocking read to see if the output has been spent.
select {
case spend, ok := <-spendEvent.Spend:
if !ok {
log.Debugf("Spend ntfn for %v canceled", op)
continue
}
spendingTx := spend.SpendingTx
log.Debugf("Detected spent of input=%v in tx=%v", op,
spendingTx.TxHash())
spentInputs[op] = spendingTx
// Move to the next input.
default:
log.Tracef("Input %v not spent yet", op)
}
}

We now fix it by calling GetUtxo first to check whether a given input is spent or not, and if it is, we will then do a block reading on the spending notification to receive a spending event.

@yyforyongyu yyforyongyu added this to the v0.20.0 milestone Jul 9, 2025
@yyforyongyu yyforyongyu self-assigned this Jul 9, 2025
Copy link

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

@yyforyongyu
Copy link
Member Author

/gemini summary

Copy link

Summary of Changes

This pull request aims to resolve a critical issue in the sweep module where expected spending events for transaction inputs could be missed, particularly after a restart. The core change involves enhancing the input spending detection mechanism by first querying the blockchain for UTXO status. If an input is found to be already spent, the system now expects an immediate notification, improving the reliability of the sweeping process and preventing erroneous retries.

Highlights

  • Spending Event Reliability: Introduced a proactive check using GetUtxo to determine if a sweep input is already spent before registering for a spend notification. This addresses a race condition where spending events might be missed during restarts, leading to incorrect assumptions about input status.
  • Immediate Notification Expectation: Modified the getSpentInputs logic to expect an immediate spending event notification if GetUtxo indicates an input is already spent. A timeout has been added to log an error if this expectation is not met, highlighting potential issues with the notification system.
  • Dependency Injection: Added ChainIO (an abstraction for blockchain queries) to TxPublisherConfig, allowing the fee_bumper to directly query UTXO status.
  • Test Coverage & Fixes: Updated various test cases in fee_bumper_test.go to mock the new GetUtxo behavior, ensuring the logic for handling spent and unspent inputs is correctly tested. Several minor typos in comments and test function names were also corrected.
Changelog
  • sweep/fee_bumper.go
    • Added time package import and defined spentNotificationTimeout constant.
    • Corrected typo from sentinalEvent to sentinelEvent.
    • Introduced ChainIO lnwallet.BlockChainIO to TxPublisherConfig for blockchain query capabilities.
    • Implemented a GetUtxo check for each input in getSpentInputs to determine if it's already spent before registering a spend notification.
    • Modified the select statement for spend events to perform a blocking read with a timeout, expecting immediate notification if GetUtxo indicates the input is spent.
    • Corrected various minor typos in comments (e.g., differetly to differently, suing to using).
  • sweep/fee_bumper_test.go
    • Imported lnmock package for mocking chain operations.
    • Updated mockers struct and createTestPublisher to include and initialize MockChain for ChainIO.
    • Corrected typo from sentinalEvent to sentinelEvent in TestBumpResultValidate.
    • Renamed test functions TestCreateAnPublishFail to TestCreateAndPublishFail and TestCreateAnPublishSuccess to TestCreateAndPublishSuccess for consistency.
    • Adjusted TestProcessRecordsInitial and TestProcessRecordsFeeBump to mock GetUtxo returning an unspent UTXO, skipping spend notification registration.
    • Adjusted TestProcessRecordsInitialSpent, TestProcessRecordsConfirmed, and TestProcessRecordsSpent to mock GetUtxo returning a spent UTXO, proceeding with spend notification registration.
    • Updated TestHasInputsSpent to include GetUtxo mocks for various input states and changed SignDesc mock expectation from Once() to Twice() for one input due to the new GetUtxo call.
    • Corrected various minor typos in comments (e.g., alaways to always).
Activity
  • The Gemini bot encountered an error while attempting to create a summary (Comment 1).
  • The author, @yyforyongyu, re-triggered the Gemini bot for a summary (Comment 2).
  • There has been no human review activity or comments on this pull request yet.

@yyforyongyu
Copy link
Member Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to prevent missing expected spending events by checking UTXOs before registering spend notifications. The changes include adding a GetUtxo call, handling its results, and updating tests. I've provided feedback on potential panic scenarios and logging improvements.

@saubyk saubyk moved this to In progress in lnd v0.20 Jul 9, 2025
@yyforyongyu
Copy link
Member Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a race condition where a spent input could be missed by the sweeper. The use of GetUtxo to verify the spent status is a good solution, and the test updates are comprehensive. The timeout handling in getSpentInputs can be improved for better correctness and resilience.

@yyforyongyu
Copy link
Member Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where spending events for swept inputs could be missed, by first checking the UTXO status with GetUtxo. The logic is sound and the implementation is robust, including a timeout for spend notifications. The test suite has been updated accordingly. My main feedback is to adjust the log severity for certain error conditions from Critical to Warn to better reflect their nature and avoid unnecessary operational alerts.

@yyforyongyu yyforyongyu force-pushed the fix-sweep branch 3 times, most recently from d15e867 to f652051 Compare July 10, 2025 00:49
We now first check whether a given input has been found in the confirmed
utxo set or not. When it's found, we can skip waiting for its spending
notification.
This commit now makes the reading of spent events blocking. This is
needed to make sure we won't miss a spent event for a spent input. Given
when an input is spent, a spent event is returned immediately, this
reading actually doesn't block, as by this point, we know for sure the
input has been spent via `GetUtxo` check.
Copy link
Contributor

@Abdulkbk Abdulkbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, had an initial pass and left some questions

@@ -1415,6 +1420,38 @@ func (t *TxPublisher) getSpentInputs(
"%v", op, heightHint)
}

// Check whether the input has been spent or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the GetUtxo call is probably just added here to save us time? because I noticed RegisterSpendNtfn also checks this internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah correct, it creates a shortcut here so we don't need to make unnecessary subscriptions. We only attempt to subscribe for spending when we know it's not in the utxo set, which means either the input has been spent or it's an orphan.

@@ -1424,7 +1461,7 @@ func (t *TxPublisher) getSpentInputs(
log.Criticalf("Failed to register spend ntfn for "+
"input=%v: %v", op, err)

return nil
return spentInputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially we return nil, and looking at the 2 instances this method is used, there is a check for the length of what was returned if len(spends) == 0 {. That would have caused LND to panic, right?.

A follow-up question is: what happens when we have multiple inputs (I guess that's a possibility), and one fails? Does that affect where we call the method since no error will be returned, and the only check I see is for the length of the returned result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning nil here actually returns an empty map, so the nil is actually a zero-value map, thus calling len won't panic.

what happens when we have multiple inputs (I guess that's a possibility), and one fails?

What do you mean one fails? If there's a failure here, then we'd shut down lnd due to Criticalf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean one fails? If there's a failure here, then we'd shut down lnd due to Criticalf.

Ah, I now understand that Criticalf sends a shutdown request after logging the error.

Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on initial pass 🙏

m.chainIO.On("GetUtxo",
&op, inp.SignDesc().Output.PkScript, inp.HeightHint(),
mock.Anything,
).Return(&wire.TxOut{}, nil).Once()

// Create a monitor record that's not confirmed. We know it's not
// confirmed because the `SpendEvent` is empty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment now misleading?

m.chainIO.On("GetUtxo",
&op, inp.SignDesc().Output.PkScript, inp.HeightHint(),
mock.Anything,
).Return(nil, nil).Once()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be useful to also return btcwallet.ErrOutputSpent for more realistic testing?

@saubyk saubyk linked an issue Jul 22, 2025 that may be closed by this pull request
@saubyk saubyk removed this from lnd v0.20 Jul 22, 2025
@saubyk saubyk requested a review from Roasbeef July 22, 2025 16:55
// is spent or not. A better approach is to implement a new
// synchronous method to check for spending, which should be
// attempted when implementing SQL into btcwallet.
case <-time.After(spentNotificationTimeout):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption here is not quite right, a spend event from RegisterSpendNtfn may arrive only very much later, since it may be doing a historical rescan for the output (and that is done from the current height back to the height hint, which can take a long time if the node was offline for some time and a force close happened in between). The same holds for the call in monitorSpend, not sure if that is problematic for the sweeper if there's a long delay between publish and spend notification.

Why do we need the spending transactions here, it looks like this is only used for logging/sanity checks, right? The docstring on r.spentInputs seems to also be misleading because all the spends may have been from the sweep transaction, I think.

@@ -1415,6 +1420,38 @@ func (t *TxPublisher) getSpentInputs(
"%v", op, heightHint)
}

// Check whether the input has been spent or not.
utxo, err := t.cfg.ChainIO.GetUtxo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, will this also populate the spend cache for neutrino backends? Otherwise, this can be a very expensive filter rescan depending on how far back they are.

In other words, this'll block for neutrno backends. Would need to check for behavior with backends that have the txindex off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just moving back to the spend channel/goroutine? That way it's always active, always watching, and we can handle the notification async when needed.

It would allow us to remove all these other default select cases for spend ntfns. I recall I pointed out a possibility of missed events when this change was originally added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, will this also populate the spend cache for neutrino backends? Otherwise, this can be a very expensive filter rescan depending on how far back they are.

FWIW this is already used in RegisterSpendNtfn implemented in neutrino,

spendReport, err := n.p2pNode.GetUtxo(

What about just moving back to the spend channel/goroutine? That way it's always active, always watching, and we can handle the notification async when needed.

Can try that route, meanwhile there's #10117 that fixes this issue using an alternative approach. I will see if it's possible to make a new sync method when implementing SQL into btcwallet.

@saubyk saubyk modified the milestones: v0.20.0, v0.19.3 Jul 29, 2025
@morehouse
Copy link
Collaborator

Oof. So basically this comment is wrong:

lnd/sweep/fee_bumper.go

Lines 1418 to 1422 in ea32aac

// If the input has already been spent after the height hint, a
// spend event is sent back immediately.
spendEvent, err := t.cfg.Notifier.RegisterSpendNtfn(
&op, inp.SignDesc().Output.PkScript, heightHint,
)

And the sending of the spend event is actually racy. This probably has broader implications than just this one piece of code -- IIRC this pattern is used in other places too.

Can we change RegisterSpendNtfn to have the desired behavior? For the sweeper, we really need a way to query for spent inputs synchronously.

@ziggie1984 ziggie1984 added the P0 very high priority issue/PR, blocker on all others label Jul 30, 2025
@yyforyongyu
Copy link
Member Author

And the sending of the spend event is actually racy. This probably has broader implications than just this one piece of code -- IIRC this pattern is used in other places too.

Yeah it's also manifested in the itest, for instance here,

func flakeTxNotifierNeutrino(ht *lntest.HarnessTest) {

and here,
func flakeRaceInBitcoinClientNotifications(ht *lntest.HarnessTest) {

Basically the block event and spend event are async. Previously there was an attempt to make them sync in blockbeat, the idea is, when a block height is received, we can directly fetch more info about the block such as inputs spent, hence making the whole flow linear. Yet there were some challenges when implementing it for neutrino, since that would mean we need to fetch every block. I think we can dig deeper to see how to make it work. Meanwhile as I'm working on SQLizing btcwallet, I will also see if there's an efficient way to implement a synchronous method that fetches the spending txns.

Will put this PR in draft now, as #10117 should fix this issue.

@yyforyongyu yyforyongyu marked this pull request as draft July 31, 2025 06:58
@saubyk saubyk modified the milestones: v0.19.3, v0.20.0 Jul 31, 2025
@saubyk saubyk added this to lnd v0.20 Jul 31, 2025
@saubyk saubyk moved this to Backlog in lnd v0.20 Jul 31, 2025
@Roasbeef
Copy link
Member

Roasbeef commented Aug 5, 2025

IMO we should just go back to the dedicated spend detection goroutine, with a goroutine per input that sends the spend event into the main channel: #10060 (comment).

It is true that the recv there will be instant, and not fall through to the default, but only if the channel has already been sent on before we enter that case.

Going back to dedicated goroutines to make sure all the spends are acted upon layers on the least amount of assumptions.

@Roasbeef
Copy link
Member

Roasbeef commented Aug 5, 2025

I took a look at #10117, it doesn't appear to resolve this overarching issue of potentially missed spends with a default select case.

@yyforyongyu
Copy link
Member Author

I took a look at #10117, it doesn't appear to resolve this overarching issue of potentially missed spends with a default select case.

This case is primarily built for detecting 3rd party anchor spend when it's grouped with other inputs, given that anchor is not grouped, we should not hit this case here.

IMO we should just go back to the dedicated spend detection goroutine

The issue is that it doesn't fit the current TxPublisher, so a refactor is needed to make it happen. I think instead of making any kind of assumptions, we can just extend blockbeat to return the block info, or provide a callback to fetch block info, given that we are already receiving the block height for every block, I find it redundant to subscribe then wait for a spending event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix P0 very high priority issue/PR, blocker on all others utxo sweeping
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

7 participants