fix channel disable/enable race condition #10277

ziggie1984 · 2025-10-08T08:02:00Z

The problem:

Timeline: Channel Disconnection and Reconnection Race Condition

Timeline: 0ms ────── 100ms ────── 800ms ────── 1100ms ────── 1300ms
          │          │           │            │             │
Channel:  [INACTIVE] → [PENDING] → [ACTIVE]   → [DISABLED]  → [ENABLED]
          │          │           │            │             │
Events:   Disconnect  Mark       Reconnect    Disable       Enable
          │          Pending     │            │             │
          │          │           │            │             │
          └──────────┼───────────┼────────────┼─────────────┘
                     │           │            │
Disable Timer:       └───────────┼────────────┘
                     (1000ms)    │
                                 │
Enable Timer:                    └────────────┼─────────────┘
                                 (500ms)     │
                                              │
                                              ❌ PROBLEM!

T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms)
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

The timing here are examples and configurable in LND, the defaults are

20min for disableTimer
19 for enableTimer

We make sure we do not disable a channel if the channel is active in the switch and the enable timer hasn't fired yet during reconnection.

Roasbeef · 2025-10-08T09:53:14Z

netann/chan_status_manager.go

+			log.Debugf("Channel(%v) became active, canceling "+
+				"scheduled disable", outpoint)
+
+			m.chanStates.markEnabled(outpoint)


Still getting through this PR, but want to make sure that we don't end up with inconsistent state here.

Eg: what if right after this check, the channel actually goes inactive? This is why we have the fallow period where we wait for the state to be consistent, before taking an action.

An alternative here would be to just continue, so not modify the state and wait for the channel to be active long enough to flip in the other direction.

yes we can do that as well, so the acutal enable timer flipped the channel correctly.

@gemini-code-assist what do you think about not changing the state here but rather let the enableRequest flip it? That however would create and additional channel_update

Thanks for the question, @ziggie1984. This is a good point to consider regarding the state transitions and network traffic.

Let's weigh the two approaches:

Current PR's approach (calling m.chanStates.markEnabled(outpoint)):

Pros: This immediately corrects the internal state of the channel within chanStates from pending disabled to enabled as soon as IsChannelActive returns true. This directly "cancels" the pending disable from this function's perspective, ensuring internal consistency. Given the disableInactiveChannels function's role, it's robust for it to explicitly manage the state when it detects a channel that should not be disabled.

Cons: As you noted, if markEnabled always triggers a channel_update announcement, and the enableRequest mechanism also sends one, there could be two updates for the same state change.

Alternative approach (just continue and let enableRequest handle it):

Pros: This would avoid a potentially redundant channel_update if the enableRequest is guaranteed to fire shortly after and handle the state transition and announcement.

Cons: The primary drawback here is that the channel's internal state within chanStates would remain pending disabled for a potentially significant period (up to 19 minutes based on the default enableTimer mentioned in the PR description), even though m.cfg.IsChannelActive(chanID) indicates it's active. This temporary inconsistency could lead to other parts of the system making decisions based on an outdated internal state. Relying solely on the enableRequest to flip the state introduces a delay that might not be acceptable for internal state accuracy.

My Recommendation:

I lean towards the current PR's approach of explicitly calling m.chanStates.markEnabled(outpoint). Ensuring immediate internal state consistency is generally more critical for system correctness and predictability than avoiding a potentially redundant network announcement. An extra channel_update is typically less harmful than prolonged internal state inconsistency, especially when dealing with long timers like the 19-minute enableTimer.

Ideally, the markEnabled function itself should be idempotent and intelligent enough to only issue a channel_update announcement if the publicly announced state of the channel genuinely changes. If the channel was already considered enabled by the enableRequest (which might have fired just before this check), then markEnabled should be a no-op in terms of external announcements. If markEnabled is not already designed this way, it might be worth reviewing its implementation to ensure it minimizes redundant announcements while maintaining internal state integrity.

@gemini-code-assist I need to correct you tho, there will be no channelUpdate msg created in case we update the state of this channel in the disable method.

Thank you for the clarification, @ziggie1984. That's a crucial detail that strengthens the argument for the current approach.

If m.chanStates.markEnabled(outpoint) does not generate a channel_update message when called within disableInactiveChannels, then the primary 'con' I identified for this approach is removed. This means we get the benefit of immediate internal state consistency (marking the channel as enabled in chanStates) without the overhead or potential redundancy of an extra network announcement.

This makes the current implementation even more robust, as it ensures the internal state accurately reflects the channel's active status without unnecessary external side effects. My recommendation to keep the explicit m.chanStates.markEnabled(outpoint) stands firm, as it prioritizes internal data integrity, which is paramount for preventing subtle bugs and ensuring predictable system behavior.

docs/release-notes/release-notes-0.20.0.md

ziggie1984 · 2025-10-08T10:31:48Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses the race condition where a channel could be disabled despite being active. The fix in chan_status_manager.go is direct and correct: re-verifying the channel's active status before disabling it is the right approach. The new test case in chan_status_manager_test.go accurately simulates the race condition and confirms the fix works as expected. The release note is also clear and helpful for users. I've only found a minor typo in the release notes.

docs/release-notes/release-notes-0.20.0.md

ziggie1984 · 2025-10-08T17:05:27Z

worked for the regtest node:

.703 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:56:13.703 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #140)
2025-10-08 13:56:51.521 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:57:13.701 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:57:13.735 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:57:13.735 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #141)
2025-10-08 13:57:51.513 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:13.736 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:58:13.756 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:58:13.756 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #142)
2025-10-08 13:58:51.511 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:58:57.832 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.839 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Pong(len(pong_bytes)=981) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Received Ping(ping_bytes=00000030645c826a260c118ee37bf58fbe51cf98a33769c6130e0c6e8d36a4f6159c6e167a4dc1a6a9eeb03352169cde49ace3ce1fa9043dbd530f49a9a6e5aa3765d73a5943e668ffff7f2004000000) from 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:58:57.937 [DBG] PEER: Peer(037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad): Sending Pong(len(pong_bytes)=1161) to 037969936f3e674c7686ce5b569c81262dbea85026e3d546f156c0b59daa76f9ad@127.0.0.1:9799
2025-10-08 13:59:13.758 [INF] DISC: Attempting to bootstrap with: Authenticated Channel Graph
2025-10-08 13:59:13.838 [ERR] SRVR: Unable to retrieve initial bootstrap peers: no addresses found
2025-10-08 13:59:13.838 [DBG] SRVR: Waiting 1m0s before trying to locate bootstrap peers (attempt #143)
2025-10-08 13:59:51.522 [DBG] HLCK: Health check: chain backend invoking success callback
2025-10-08 13:59:57.829 [DBG] NANN: Channel(23a137da04ed053dad8aa397e2c2e1b127d8c781bab3fc4c1bb6052348110a98:0) already enabled, skipped announcement

yyforyongyu

Looks like it's this case caught by the itest?

lnd/itest/lnd_channel_graph_test.go

Lines 22 to 44 in 87aed73

    
           // testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update 
        
           // the channel graph as expected, and that channel state is properly updated 
        
           // in the presence of interleaved node disconnects / reconnects. 
        
           // 
        
           // NOTE: this test can be flaky as we are testing the chan-enable-timeout and 
        
           // chan-disable-timeout flags here. 
        
           // 
        
           // For chan-enable-timeout, setting this value too small will cause an enabled 
        
           // channel never be considered active by our channel status manager. Upon 
        
           // reconnection, our Brontide will send a request to enable this channel after 
        
           // the "chan-enable-timeout" has passed. The request is handled by the channel 
        
           // status manager, which will check the channel's eligibility to forward links 
        
           // by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the 
        
           // link as eligible unless it has finished its initialization, which takes some 
        
           // time. Thus, if the Brontide sends a request too early it will get a false 
        
           // report saying the channel link is not eligible because that link hasn't 
        
           // finished its initialization. 
        
           // 
        
           // For chan-disable-timeout, setting this value too small will cause an already 
        
           // enabled channel being marked as disabled. For instance, if some operations 
        
           // take more than 5 seconds to finish, the channel will be marked as disabled, 
        
           // thus a following operation will fail if it relies on the channel being 
        
           // enabled.

Based on the description, I don't think you will have this issue if you have reasonable disable/enable timeout values, plus the channel will be re-enabled after chan-enable-timeout. Re this case,

T=0ms:    Channel disconnects
T=100ms:  Marked as pending disabled → Disable timer starts (1000ms)
T=800ms:  Channel reconnects → Enable timer starts (500ms) ❗ ---> we should cancel the disable timer before firing the enable timer
T=1100ms: Disable timer fires → Channel gets DISABLED ❌ although it is active
T=1300ms: Enable timer fires → Channel gets ENABLED

I think the fix should be we cancel the disable timer before we start the enable timer.

ziggie1984 · 2025-10-10T13:00:20Z

@gemini-code-assist can you elaborate on this proposal: I think the fix should be we cancel the disable timer before we start the enable timer.

gemini-code-assist · 2025-10-10T13:00:41Z

That's a great question, @ziggie1984! Let's break down the proposal: "I think the fix should be we cancel the disable timer before we start the enable timer."

The core problem, as you've clearly outlined in the PR description, is a race condition where a channel might become active again after a disable timer has started, but before that timer fires. This could lead to the channel being unnecessarily disabled.

Current PR's Approach (Reactive Check):
The current changes in netann/chan_status_manager.go address this by adding a re-verification step when the disable timer actually fires. Specifically, at lines 571-586:

// Re-verify the channel is still inactive before disabling.
// This prevents the race condition where a channel becomes
// active again after being marked as pending disabled but
// before the disable timeout expires. Otherwise there is a
// race condition where the channel is disabled even though it
// is active.
chanID := lnwire.NewChanIDFromOutPoint(outpoint)
if m.cfg.IsChannelActive(chanID) {
	// Channel became active again, cancel the pending
	// disable.
	log.Debugf("Channel(%v) became active, canceling "+
		"scheduled disable", outpoint)

	m.chanStates.markEnabled(outpoint)

	continue
}

This code ensures that if a channel has become active again by the time its disable timer fires, it won't be disabled. It effectively "cancels" the disable by marking the channel as enabled and skipping the disable logic.

Your Proposal (Proactive Cancellation):
Your suggestion, "cancel the disable timer before we start the enable timer," implies a more proactive approach. This would mean that as soon as the system detects a channel has reconnected and is about to become active (and thus an enable timer would start), it should explicitly cancel any pending disable timer for that channel.

Benefits of Proactive Cancellation:

Efficiency: By canceling the disable timer as soon as the channel is known to be active, you prevent the timer from even firing unnecessarily. The current approach lets the timer fire, then performs a check, which is slightly less efficient in terms of CPU cycles and event handling, although the overhead is likely minimal.
Cleaner State Management: Explicitly canceling the timer might lead to a clearer mental model of the channel's state transitions. When a channel becomes active, its state immediately transitions away from "pending disabled," and any associated disable actions are preemptively stopped.

Comparison:
Both approaches effectively solve the race condition. The current PR's solution is a robust reactive check, ensuring correctness. Your proposed proactive cancellation is also correct and could offer a slight improvement in efficiency and state clarity by preventing the disable timer from ever reaching its callback if the channel has already reconnected.

It largely depends on how the timers are managed internally. If there's a mechanism to easily cancel a scheduled timer by its ID or associated channel, then the proactive approach could be a very clean way to handle it. If not, the current reactive check is a perfectly valid and safe solution.

ziggie1984 · 2025-10-10T13:03:23Z

@gemini-code-assist how would a proof of concept look like if we start disabling the timer, because currently the timer also checks all other channels is it easily possible ?

gemini-code-assist · 2025-10-10T13:03:39Z

That's an excellent follow-up question, @ziggie1984! Let's think about how a proof-of-concept for proactively canceling the disable timer might look, especially considering the current implementation's structure.

Current Timer Mechanism

As you noted, the disableInactiveChannels function iterates through all channels that are marked as pending disabled and then performs the re-verification check. The timer itself (which triggers disableInactiveChannels) is a single, periodic timer that processes a batch of channels.

Proof-of-Concept for Proactive Cancellation

To implement your suggestion of "canceling the disable timer before we start the enable timer" in a proactive manner, we would need a more granular control over individual channel disable timers. Here's a conceptual outline:

Individual Disable Timers: Instead of a single periodic timer that checks all channels, each time a channel is marked pending disabled, a specific timer would be started for that channel's disableTimeout.
- This timer would be associated directly with the outpoint (channel ID).
- When this individual timer fires, it would then perform the disable logic for only that specific channel.
Tracking Active Timers: The ChanStatusManager would need to maintain a data structure (e.g., a map[lnwire.OutPoint]*time.Timer) to keep track of these active disable timers.
Proactive Cancellation Logic: When a channel reconnects and becomes active (and an enable timer would start):
- The system would check if there's an active disable timer for that specific outpoint in our new tracking map.
- If found, it would explicitly call timer.Stop() on that specific timer instance.
- After stopping, the entry for that channel would be removed from the tracking map.

Example Flow (Conceptual):

Let's revisit your timeline with this proactive approach:

T=0ms: Channel disconnects.
T=100ms: Channel marked pending disabled. A new timer is created and started for this specific channel, set to fire at T=1100ms. This timer's reference is stored in ChanStatusManager's map.
T=800ms: Channel reconnects. The system immediately checks if a disable timer exists for this channel. It finds the timer started at T=100ms. It calls timer.Stop() on that specific timer. The channel is then marked active, and an enable timer might start (or it's immediately enabled if no timer is needed).
T=1100ms: The previously scheduled disable timer does not fire because it was proactively stopped.

Challenges and Considerations:

Complexity of Timer Management: Go's time.Timer objects can be stopped, but managing a large number of individual timers (one per pending disabled channel) adds overhead. You'd need to ensure proper cleanup and avoid memory leaks if timers aren't stopped correctly.
Synchronization: Access to the map of active timers would need to be carefully synchronized to prevent race conditions.
Existing disableInactiveChannels: The current disableInactiveChannels function processes all channels in a batch. This would need to be refactored significantly to accommodate individual timers. The current approach is simpler in terms of a single periodic task.

While conceptually cleaner for the specific channel's lifecycle, this proactive approach introduces more complex state management within the ChanStatusManager itself. The current reactive check, which re-verifies the channel's status just before disabling, is simpler to implement and still guarantees correctness, even if it's slightly less

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 4e1de15 to 72322e2 Compare October 8, 2025 08:02

ziggie1984 self-assigned this Oct 8, 2025

ziggie1984 marked this pull request as ready for review October 8, 2025 08:15

ziggie1984 added the gossip label Oct 8, 2025

ziggie1984 added this to lnd v0.20 Oct 8, 2025

ziggie1984 added this to the v0.20.0 milestone Oct 8, 2025

netann: fix disable issue for channels

a97dda8

We make sure we do not disable a channel if the channel is active in the switch and the enable timer hasn't fired yet during reconnection.

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 72322e2 to 4c7572b Compare October 8, 2025 09:29

Roasbeef requested changes Oct 8, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

docs/release-notes/release-notes-0.20.0.md Outdated Show resolved Hide resolved

docs: add release-notes

380f79f

ziggie1984 force-pushed the fix-status-manager-disable-issue branch from 33c7f05 to 380f79f Compare October 8, 2025 17:06

saubyk removed this from lnd v0.20 Oct 8, 2025

yyforyongyu requested changes Oct 9, 2025

View reviewed changes

	// testUpdateChanStatus checks that calls to the UpdateChanStatus RPC update
	// the channel graph as expected, and that channel state is properly updated
	// in the presence of interleaved node disconnects / reconnects.
	//
	// NOTE: this test can be flaky as we are testing the chan-enable-timeout and
	// chan-disable-timeout flags here.
	//
	// For chan-enable-timeout, setting this value too small will cause an enabled
	// channel never be considered active by our channel status manager. Upon
	// reconnection, our Brontide will send a request to enable this channel after
	// the "chan-enable-timeout" has passed. The request is handled by the channel
	// status manager, which will check the channel's eligibility to forward links
	// by asking htlcswitch/link. Meanwhile, the htlcswitch/link won't mark the
	// link as eligible unless it has finished its initialization, which takes some
	// time. Thus, if the Brontide sends a request too early it will get a false
	// report saying the channel link is not eligible because that link hasn't
	// finished its initialization.
	//
	// For chan-disable-timeout, setting this value too small will cause an already
	// enabled channel being marked as disabled. For instance, if some operations
	// take more than 5 seconds to finish, the channel will be marked as disabled,
	// thus a following operation will fail if it relies on the channel being
	// enabled.

fix channel disable/enable race condition #10277

Are you sure you want to change the base?

fix channel disable/enable race condition #10277

Uh oh!

Conversation

ziggie1984 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Roasbeef Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ziggie1984 commented Oct 8, 2025

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Oct 10, 2025

Uh oh!

gemini-code-assist bot commented Oct 10, 2025

Uh oh!

ziggie1984 commented Oct 10, 2025

Uh oh!

gemini-code-assist bot commented Oct 10, 2025

Current Timer Mechanism

Proof-of-Concept for Proactive Cancellation

Example Flow (Conceptual):

Challenges and Considerations:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ziggie1984 commented Oct 8, 2025 •

edited

Loading

ziggie1984 Oct 8, 2025 •

edited

Loading

ziggie1984 Oct 8, 2025 •

edited

Loading