Skip to content

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 3, 2025

Fixes #8125

The original race in the xDS client:

  • Resource watch is cancelled by the user of the xdsClient (e.g. xdsResolver)
  • xdsClient removes the resource from its cache and queues an unsubscribe request to the ADS stream.
  • A watch for the same resource is registered immediately, and the xdsClient instructs the ADS stream to subscribe (as it's not in cache).
  • The ADS stream sends a redundant request (same resources, version, nonce) which the management server ignores.
  • The new resource watch sees a "resource-not-found" error once the watch timer fires.

The original fix:

Delay the resource's removal from the cache until the unsubscribe request was transmitted over the wire, a change implemented in #8369. However, this solution introduced new complications:

  • The resource's removal from the xdsClient's cache became an asynchronous operation, occurring while the unsubscribe request was being sent.
  • This asynchronous behavior meant the state maintained within the ADS stream could still diverge from the cache's state.
  • A critical section was absent between the ADS stream's message transmission logic and the xdsClient's cache access, which is performed during subscription/unsubscription by its users.

The root cause of the previous seen races can be put down two things:

  • Batching of writes for subscribe and unsubscribe calls
    • After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again, and therefore the management server would not send any response. It is important that the management server see the exact sequence of subscribe and unsubscribe calls.
  • State maintained in the ADS stream going out of sync with the state maintained in the resource cache

How does this PR address the above issue?

This PR simplifies the implementation of the ADS stream by removing two pieces of functionality

  • Stop batching of writes on the ADS stream
    • If the user registers multiple watches, e.g. resource A, B, and C, the stream would now send three requests: [A], [A B], [A B C].
  • Queue the exact request to be sent out based on the current state
    • As part of handling a subscribe/unsubscribe request, the ADS stream implementation will queue the exact request to be sent out. When asynchronously sending the request out, it will not use the current state, but instead just write the queued request on the wire.
  • Don't buffer writes when waiting for flow control
    • Flow control is already blocking reads from the stream. Blocking writes as well during this period might provide some additional flow control, but not much, and removing this logic simplifies the stream implementation quite a bit.

RELEASE NOTES:

  • xdsclient: fix a race in the xdsClient that could lead to resource-not-found errors

@easwars easwars added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Oct 3, 2025
@easwars easwars added this to the 1.77 Release milestone Oct 3, 2025
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 65.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.51%. Comparing base (8389ddb) to head (8695d83).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/clients/xdsclient/ads_stream.go 59.25% 6 Missing and 5 partials ⚠️
internal/xds/clients/internal/buffer/unbounded.go 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8627      +/-   ##
==========================================
- Coverage   82.13%   81.51%   -0.62%     
==========================================
  Files         415      416       +1     
  Lines       40711    40894     +183     
==========================================
- Hits        33437    33335     -102     
- Misses       5897     6131     +234     
- Partials     1377     1428      +51     
Files with missing lines Coverage Δ
internal/xds/clients/xdsclient/channel.go 77.12% <100.00%> (-1.73%) ⬇️
internal/xds/clients/internal/buffer/unbounded.go 83.33% <75.00%> (-16.67%) ⬇️
internal/xds/clients/xdsclient/ads_stream.go 51.39% <59.25%> (-32.77%) ⬇️

... and 50 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars easwars requested review from arjan-bal and dfawley October 7, 2025 23:32
@easwars
Copy link
Contributor Author

easwars commented Oct 7, 2025

@danielzhaotongliu

@arjan-bal
Copy link
Contributor

Could you explain how the new logic prevents the following race condition that can cause a channel to briefly enter TRANSIENT_FAILURE?
I'm thinking of this specific scenario:

  1. Channel 0, which is watching LDS resource L0, is closed. The xDS client queues an unsubscribe request for L0 to be sent to the management server.
  2. Immediately after, a new channel (Channel 1) starts and its xDS resolver registers a watch for the exact same resource, L0. The client now queues a new subscription request for L0.
  3. Before the management server processes the new subscription for L0, it sends a DiscoveryResponse based on the previous state (where the client had unsubscribed). This response therefore excludes L0.
  4. The xDS client receives this response and notifies the active watcher for L0 (belonging to Channel 1) that the resource is missing. This causes Channel 1 to enter TRANSIENT_FAILURE until the next update from the server includes L0.

How does this PR ensure that the watcher for Channel 1 isn't incorrectly notified of a missing resource in this situation?

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Oct 13, 2025
@easwars
Copy link
Contributor Author

easwars commented Oct 13, 2025

The above condition that you mention will not lead to Channel 1 entering TRANSIENT_FAILURE. This case is taken care of explicitly here:

if state.cache == nil {

In fact this is case that can happen even without any of the race conditions involving the xDS client and this is also explicitly mentioned in the envoy xDS documentation here: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist

Hope this helps.

@easwars easwars assigned arjan-bal and unassigned easwars Oct 13, 2025
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return
}

b.backlog = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Will setting b.backlog = b.backlog[:0] be better since it avoids re-allocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

// resource is stopped if one is active. A discovery request is sent out on the
// stream for the resource type when there is sufficient flow control quota.
// stream for the resource type with the updated set of resource names.
func (s *adsStreamImpl) Unsubscribe(typ ResourceType, name string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I noticed that the subscribe method is unexported, while Unsubscribe is exported. Maybe we should unexport Unsubscribe too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@arjan-bal arjan-bal removed their assignment Oct 16, 2025
@arjan-bal
Copy link
Contributor

The above condition that you mention will not lead to Channel 1 entering TRANSIENT_FAILURE. This case is taken care of explicitly here:

if state.cache == nil {

In fact this is case that can happen even without any of the race conditions involving the xDS client and this is also explicitly mentioned in the envoy xDS documentation here: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist

Hope this helps.

I see. The difference in the scenario I asked about and the problem caused by the initial fix is the presence of the LDS resource in the cache. If L0 is present in the cache but not in the latest update from the management server, the channel is put in TF. However, if L0 is not present in the cache, we wait for the expiry timer to expire.

@arjan-bal
Copy link
Contributor

Though the PR description doesn't explicitly call out how the change addresses the race condition, my understanding is that the race is caused because the writes for subscribe and unsubscribe calls are batched. After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again. By removing batching of writes, the management server will see the unsubscription and re-subscription of the resource and send the necessary update.

@easwars
Copy link
Contributor Author

easwars commented Oct 16, 2025

If L0 is present in the cache but not in the latest update from the management server, the channel is put in TF. However, if L0 is not present in the cache, we wait for the expiry timer to expire.

That is true.

And the first part about the resource being in the cache, but not in a response from the management server resulting in the channel moving to TF applies only to LDS and CDS. For RDS and EDS, if a similar scenario happens, we would simply continue using the resource in the cache.

@easwars
Copy link
Contributor Author

easwars commented Oct 16, 2025

Though the PR description doesn't explicitly call out how the change addresses the race condition, my understanding is that the race is caused because the writes for subscribe and unsubscribe calls are batched. After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again. By removing batching of writes, the management server will see the unsubscription and re-subscription of the resource and send the necessary update.

Your understanding is correct and I've clarified the exact underlying issue causing the race and how the change addresses them in the PR description. Thanks.

//
// It's expected to be used in scenarios where the buffered data is no longer
// relevant, and needs to be cleared.
func (b *Unbounded) Reset() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherently racy with other calls to Put() (and Load(), of course). I hope we are using it correctly! :) And if we have an external lock that is used for all things that call Put and Reset, I wonder if a different data structure might be more efficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherently racy with other calls to Put() (and Load(), of course). I hope we are using it correctly!

I had the same question. My understanding is that Reset acts as an optimization to avoid redundant requests, but it isn't required for correctness.

Even so, you're right about the potential for races. I think adding a godoc comment to warn about this is a good idea.

state, ok := s.resourceTypeState[typ]
if !ok {
// State is created when the first subscription for this type is made.
panic(fmt.Sprintf("no state exists for resource type %v", typ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a production panic? Or should this be log.Errorf() (which fails all tests) and a return <some error> (or nil since the stream is still usable, depending on what it means to return an error?)? The latter feels safer.


// Clear any queued requests. Previously subscribed to resources will be
// resent below.
s.requestCh.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mandate that s.mu is held when calling Put or Reset?

Maybe we should wrap accesses to it somehow to guarantee that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xdsclient: race around resource subscriptions and unsubscsriptions

3 participants