xdsclient: stop batching writes on the ADS stream #8627

easwars · 2025-10-03T21:36:42Z

The original race in the xDS client:

Resource watch is cancelled by the user of the xdsClient (e.g. xdsResolver)
xdsClient removes the resource from its cache and queues an unsubscribe request to the ADS stream.
A watch for the same resource is registered immediately, and the xdsClient instructs the ADS stream to subscribe (as it's not in cache).
The ADS stream sends a redundant request (same resources, version, nonce) which the management server ignores.
The new resource watch sees a "resource-not-found" error once the watch timer fires.

The original fix:

Delay the resource's removal from the cache until the unsubscribe request was transmitted over the wire, a change implemented in #8369. However, this solution introduced new complications:

The resource's removal from the xdsClient's cache became an asynchronous operation, occurring while the unsubscribe request was being sent.
This asynchronous behavior meant the state maintained within the ADS stream could still diverge from the cache's state.
A critical section was absent between the ADS stream's message transmission logic and the xdsClient's cache access, which is performed during subscription/unsubscription by its users.

The root cause of the previous seen races can be put down two things:

Batching of writes for subscribe and unsubscribe calls
- After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again, and therefore the management server would not send any response. It is important that the management server see the exact sequence of subscribe and unsubscribe calls.
State maintained in the ADS stream going out of sync with the state maintained in the resource cache

How does this PR address the above issue?

This PR simplifies the implementation of the ADS stream by removing two pieces of functionality

Stop batching of writes on the ADS stream
- If the user registers multiple watches, e.g. resource A, B, and C, the stream would now send three requests: [A], [A B], [A B C].
Queue the exact request to be sent out based on the current state
- As part of handling a subscribe/unsubscribe request, the ADS stream implementation will queue the exact request to be sent out. When asynchronously sending the request out, it will not use the current state, but instead just write the queued request on the wire.
Don't buffer writes when waiting for flow control
- Flow control is already blocking reads from the stream. Blocking writes as well during this period might provide some additional flow control, but not much, and removing this logic simplifies the stream implementation quite a bit.

RELEASE NOTES:

xdsclient: fix a race in the xdsClient that could lead to resource-not-found errors

codecov · 2025-10-03T21:40:18Z

Codecov Report

❌ Patch coverage is 65.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.51%. Comparing base (8389ddb) to head (8695d83).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/xds/clients/xdsclient/ads_stream.go	59.25%	6 Missing and 5 partials ⚠️
internal/xds/clients/internal/buffer/unbounded.go	75.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8627      +/-   ##
==========================================
- Coverage   82.13%   81.51%   -0.62%     
==========================================
  Files         415      416       +1     
  Lines       40711    40894     +183     
==========================================
- Hits        33437    33335     -102     
- Misses       5897     6131     +234     
- Partials     1377     1428      +51

Files with missing lines	Coverage Δ
internal/xds/clients/xdsclient/channel.go	`77.12% <100.00%> (-1.73%)`	⬇️
internal/xds/clients/internal/buffer/unbounded.go	`83.33% <75.00%> (-16.67%)`	⬇️
internal/xds/clients/xdsclient/ads_stream.go	`51.39% <59.25%> (-32.77%)`	⬇️

... and 50 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

easwars · 2025-10-07T23:35:24Z

@danielzhaotongliu

arjan-bal · 2025-10-13T12:12:45Z

Could you explain how the new logic prevents the following race condition that can cause a channel to briefly enter TRANSIENT_FAILURE?
I'm thinking of this specific scenario:

Channel 0, which is watching LDS resource L0, is closed. The xDS client queues an unsubscribe request for L0 to be sent to the management server.
Immediately after, a new channel (Channel 1) starts and its xDS resolver registers a watch for the exact same resource, L0. The client now queues a new subscription request for L0.
Before the management server processes the new subscription for L0, it sends a DiscoveryResponse based on the previous state (where the client had unsubscribed). This response therefore excludes L0.
The xDS client receives this response and notifies the active watcher for L0 (belonging to Channel 1) that the resource is missing. This causes Channel 1 to enter TRANSIENT_FAILURE until the next update from the server includes L0.

How does this PR ensure that the watcher for Channel 1 isn't incorrectly notified of a missing resource in this situation?

easwars · 2025-10-13T19:15:40Z

The above condition that you mention will not lead to Channel 1 entering TRANSIENT_FAILURE. This case is taken care of explicitly here:

grpc-go/internal/xds/clients/xdsclient/authority.go

Line 444 in 8110884

if state.cache == nil {

In fact this is case that can happen even without any of the race conditions involving the xDS client and this is also explicitly mentioned in the envoy xDS documentation here: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist

Hope this helps.

arjan-bal

LGTM

arjan-bal · 2025-10-14T15:53:14Z

internal/xds/clients/internal/buffer/unbounded.go

+		return
+	}
+
+	b.backlog = nil


nit: Will setting b.backlog = b.backlog[:0] be better since it avoids re-allocation?

Done. Thanks.

arjan-bal · 2025-10-16T08:51:43Z

internal/xds/clients/xdsclient/ads_stream.go

 // resource is stopped if one is active. A discovery request is sent out on the
-// stream for the resource type when there is sufficient flow control quota.
+// stream for the resource type with the updated set of resource names.
 func (s *adsStreamImpl) Unsubscribe(typ ResourceType, name string) {


nit: I noticed that the subscribe method is unexported, while Unsubscribe is exported. Maybe we should unexport Unsubscribe too?

arjan-bal · 2025-10-16T09:02:52Z

The above condition that you mention will not lead to Channel 1 entering TRANSIENT_FAILURE. This case is taken care of explicitly here:

grpc-go/internal/xds/clients/xdsclient/authority.go

Line 444 in 8110884

if state.cache == nil {

In fact this is case that can happen even without any of the race conditions involving the xDS client and this is also explicitly mentioned in the envoy xDS documentation here: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist

Hope this helps.

I see. The difference in the scenario I asked about and the problem caused by the initial fix is the presence of the LDS resource in the cache. If L0 is present in the cache but not in the latest update from the management server, the channel is put in TF. However, if L0 is not present in the cache, we wait for the expiry timer to expire.

arjan-bal · 2025-10-16T09:08:25Z

Though the PR description doesn't explicitly call out how the change addresses the race condition, my understanding is that the race is caused because the writes for subscribe and unsubscribe calls are batched. After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again. By removing batching of writes, the management server will see the unsubscription and re-subscription of the resource and send the necessary update.

easwars · 2025-10-16T19:13:08Z

If L0 is present in the cache but not in the latest update from the management server, the channel is put in TF. However, if L0 is not present in the cache, we wait for the expiry timer to expire.

That is true.

And the first part about the resource being in the cache, but not in a response from the management server resulting in the channel moving to TF applies only to LDS and CDS. For RDS and EDS, if a similar scenario happens, we would simply continue using the resource in the cache.

easwars · 2025-10-16T19:28:37Z

Though the PR description doesn't explicitly call out how the change addresses the race condition, my understanding is that the race is caused because the writes for subscribe and unsubscribe calls are batched. After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again. By removing batching of writes, the management server will see the unsubscription and re-subscription of the resource and send the necessary update.

Your understanding is correct and I've clarified the exact underlying issue causing the race and how the change addresses them in the PR description. Thanks.

dfawley · 2025-10-17T18:14:03Z

internal/xds/clients/internal/buffer/unbounded.go

+//
+// It's expected to be used in scenarios where the buffered data is no longer
+// relevant, and needs to be cleared.
+func (b *Unbounded) Reset() {


This is inherently racy with other calls to Put() (and Load(), of course). I hope we are using it correctly! :) And if we have an external lock that is used for all things that call Put and Reset, I wonder if a different data structure might be more efficient?

This is inherently racy with other calls to Put() (and Load(), of course). I hope we are using it correctly!

I had the same question. My understanding is that Reset acts as an optimization to avoid redundant requests, but it isn't required for correctness.

Even so, you're right about the potential for races. I think adding a godoc comment to warn about this is a good idea.

dfawley · 2025-10-17T18:19:25Z

internal/xds/clients/xdsclient/ads_stream.go

+	state, ok := s.resourceTypeState[typ]
+	if !ok {
+		// State is created when the first subscription for this type is made.
+		panic(fmt.Sprintf("no state exists for resource type %v", typ))


Do we want a production panic? Or should this be log.Errorf() (which fails all tests) and a return <some error> (or nil since the stream is still usable, depending on what it means to return an error?)? The latter feels safer.

dfawley · 2025-10-17T18:23:11Z

internal/xds/clients/xdsclient/ads_stream.go


+	// Clear any queued requests. Previously subscribed to resources will be
+	// resent below.
+	s.requestCh.Reset()


Do we need to mandate that s.mu is held when calling Put or Reset?

Maybe we should wrap accesses to it somehow to guarantee that?

xdsclient: stop batching writes on the ADS stream

54c1664

easwars added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Oct 3, 2025

easwars added this to the 1.77 Release milestone Oct 3, 2025

easwars requested review from arjan-bal and dfawley October 7, 2025 23:32

easwars assigned arjan-bal and dfawley Oct 9, 2025

arjan-bal assigned easwars and unassigned arjan-bal Oct 13, 2025

easwars assigned arjan-bal and unassigned easwars Oct 13, 2025

arjan-bal approved these changes Oct 16, 2025

View reviewed changes

arjan-bal removed their assignment Oct 16, 2025

initial review comments from Arjan

8695d83

dfawley reviewed Oct 17, 2025

View reviewed changes

xdsclient: stop batching writes on the ADS stream #8627

Are you sure you want to change the base?

xdsclient: stop batching writes on the ADS stream #8627

Uh oh!

Conversation

easwars commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The original race in the xDS client:

The original fix:

The root cause of the previous seen races can be put down two things:

How does this PR address the above issue?

Uh oh!

codecov bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

easwars commented Oct 7, 2025

Uh oh!

arjan-bal commented Oct 13, 2025

Uh oh!

easwars commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjan-bal commented Oct 16, 2025

Uh oh!

arjan-bal commented Oct 16, 2025

Uh oh!

easwars commented Oct 16, 2025

Uh oh!

easwars commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

easwars commented Oct 3, 2025 •

edited

Loading

codecov bot commented Oct 3, 2025 •

edited

Loading

easwars commented Oct 13, 2025 •

edited

Loading