-
Notifications
You must be signed in to change notification settings - Fork 2.2k
sweep: fix expected spending events being missed #10060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Warning Gemini encountered an error creating the summary. You can try again by commenting |
/gemini summary |
Summary of ChangesThis pull request aims to resolve a critical issue in the sweep module where expected spending events for transaction inputs could be missed, particularly after a restart. The core change involves enhancing the input spending detection mechanism by first querying the blockchain for UTXO status. If an input is found to be already spent, the system now expects an immediate notification, improving the reliability of the sweeping process and preventing erroneous retries. Highlights
Changelog
Activity
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix to prevent missing expected spending events by checking UTXOs before registering spend notifications. The changes include adding a GetUtxo
call, handling its results, and updating tests. I've provided feedback on potential panic scenarios and logging improvements.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a race condition where a spent input could be missed by the sweeper. The use of GetUtxo
to verify the spent status is a good solution, and the test updates are comprehensive. The timeout handling in getSpentInputs
can be improved for better correctness and resilience.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where spending events for swept inputs could be missed, by first checking the UTXO status with GetUtxo
. The logic is sound and the implementation is robust, including a timeout for spend notifications. The test suite has been updated accordingly. My main feedback is to adjust the log severity for certain error conditions from Critical
to Warn
to better reflect their nature and avoid unnecessary operational alerts.
d15e867
to
f652051
Compare
We now first check whether a given input has been found in the confirmed utxo set or not. When it's found, we can skip waiting for its spending notification.
This commit now makes the reading of spent events blocking. This is needed to make sure we won't miss a spent event for a spent input. Given when an input is spent, a spent event is returned immediately, this reading actually doesn't block, as by this point, we know for sure the input has been spent via `GetUtxo` check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, had an initial pass and left some questions
@@ -1415,6 +1420,38 @@ func (t *TxPublisher) getSpentInputs( | |||
"%v", op, heightHint) | |||
} | |||
|
|||
// Check whether the input has been spent or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the GetUtxo
call is probably just added here to save us time? because I noticed RegisterSpendNtfn
also checks this internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah correct, it creates a shortcut here so we don't need to make unnecessary subscriptions. We only attempt to subscribe for spending when we know it's not in the utxo set, which means either the input has been spent or it's an orphan.
@@ -1424,7 +1461,7 @@ func (t *TxPublisher) getSpentInputs( | |||
log.Criticalf("Failed to register spend ntfn for "+ | |||
"input=%v: %v", op, err) | |||
|
|||
return nil | |||
return spentInputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So initially we return nil
, and looking at the 2 instances this method is used, there is a check for the length of what was returned if len(spends) == 0 {
. That would have caused LND to panic, right?.
A follow-up question is: what happens when we have multiple inputs (I guess that's a possibility), and one fails? Does that affect where we call the method since no error will be returned, and the only check I see is for the length of the returned result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning nil
here actually returns an empty map, so the nil
is actually a zero-value map, thus calling len
won't panic.
what happens when we have multiple inputs (I guess that's a possibility), and one fails?
What do you mean one fails? If there's a failure here, then we'd shut down lnd
due to Criticalf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean one fails? If there's a failure here, then we'd shut down lnd due to Criticalf.
Ah, I now understand that Criticalf
sends a shutdown request after logging the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on initial pass 🙏
m.chainIO.On("GetUtxo", | ||
&op, inp.SignDesc().Output.PkScript, inp.HeightHint(), | ||
mock.Anything, | ||
).Return(&wire.TxOut{}, nil).Once() | ||
|
||
// Create a monitor record that's not confirmed. We know it's not | ||
// confirmed because the `SpendEvent` is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment now misleading?
m.chainIO.On("GetUtxo", | ||
&op, inp.SignDesc().Output.PkScript, inp.HeightHint(), | ||
mock.Anything, | ||
).Return(nil, nil).Once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be useful to also return btcwallet.ErrOutputSpent
for more realistic testing?
// is spent or not. A better approach is to implement a new | ||
// synchronous method to check for spending, which should be | ||
// attempted when implementing SQL into btcwallet. | ||
case <-time.After(spentNotificationTimeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption here is not quite right, a spend event from RegisterSpendNtfn
may arrive only very much later, since it may be doing a historical rescan for the output (and that is done from the current height back to the height hint, which can take a long time if the node was offline for some time and a force close happened in between). The same holds for the call in monitorSpend
, not sure if that is problematic for the sweeper if there's a long delay between publish and spend notification.
Why do we need the spending transactions here, it looks like this is only used for logging/sanity checks, right? The docstring on r.spentInputs
seems to also be misleading because all the spends may have been from the sweep transaction, I think.
@@ -1415,6 +1420,38 @@ func (t *TxPublisher) getSpentInputs( | |||
"%v", op, heightHint) | |||
} | |||
|
|||
// Check whether the input has been spent or not. | |||
utxo, err := t.cfg.ChainIO.GetUtxo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, will this also populate the spend cache for neutrino backends? Otherwise, this can be a very expensive filter rescan depending on how far back they are.
In other words, this'll block for neutrno backends. Would need to check for behavior with backends that have the txindex off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just moving back to the spend channel/goroutine? That way it's always active, always watching, and we can handle the notification async when needed.
It would allow us to remove all these other default select cases for spend ntfns. I recall I pointed out a possibility of missed events when this change was originally added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, will this also populate the spend cache for neutrino backends? Otherwise, this can be a very expensive filter rescan depending on how far back they are.
FWIW this is already used in RegisterSpendNtfn
implemented in neutrino,
lnd/chainntnfs/neutrinonotify/neutrino.go
Line 864 in 4389067
spendReport, err := n.p2pNode.GetUtxo( |
What about just moving back to the spend channel/goroutine? That way it's always active, always watching, and we can handle the notification async when needed.
Can try that route, meanwhile there's #10117 that fixes this issue using an alternative approach. I will see if it's possible to make a new sync method when implementing SQL into btcwallet.
Oof. So basically this comment is wrong: Lines 1418 to 1422 in ea32aac
And the sending of the spend event is actually racy. This probably has broader implications than just this one piece of code -- IIRC this pattern is used in other places too. Can we change |
Yeah it's also manifested in the itest, for instance here, Line 41 in 23dd01c
and here, Line 137 in 23dd01c
Basically the block event and spend event are async. Previously there was an attempt to make them sync in Will put this PR in draft now, as #10117 should fix this issue. |
IMO we should just go back to the dedicated spend detection goroutine, with a goroutine per input that sends the spend event into the main channel: #10060 (comment). It is true that the recv there will be instant, and not fall through to the default, but only if the channel has already been sent on before we enter that case. Going back to dedicated goroutines to make sure all the spends are acted upon layers on the least amount of assumptions. |
I took a look at #10117, it doesn't appear to resolve this overarching issue of potentially missed spends with a default select case. |
This case is primarily built for detecting 3rd party anchor spend when it's grouped with other inputs, given that anchor is not grouped, we should not hit this case here.
The issue is that it doesn't fit the current |
Fix the issue #10051. What happened there was,
CommitmentAnchor
andCommitmentTimeLock
in the same group, while the anchor has already been spent.However, from the logs there, the spending event was not notified quickly enough here, causing us to think there's no spent of the anchor input,
lnd/sweep/fee_bumper.go
Lines 1433 to 1453 in ea32aac
We now fix it by calling
GetUtxo
first to check whether a given input is spent or not, and if it is, we will then do a block reading on the spending notification to receive a spending event.