-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xdsclient: stop batching writes on the ADS stream #8627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
xdsclient: stop batching writes on the ADS stream #8627
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8627 +/- ##
==========================================
- Coverage 82.13% 81.51% -0.62%
==========================================
Files 415 416 +1
Lines 40711 40894 +183
==========================================
- Hits 33437 33335 -102
- Misses 5897 6131 +234
- Partials 1377 1428 +51
🚀 New features to boost your workflow:
|
Could you explain how the new logic prevents the following race condition that can cause a channel to briefly enter
How does this PR ensure that the watcher for Channel 1 isn't incorrectly notified of a missing resource in this situation? |
The above condition that you mention will not lead to Channel 1 entering
In fact this is case that can happen even without any of the race conditions involving the xDS client and this is also explicitly mentioned in the envoy xDS documentation here: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist Hope this helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return | ||
} | ||
|
||
b.backlog = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Will setting b.backlog = b.backlog[:0]
be better since it avoids re-allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
// resource is stopped if one is active. A discovery request is sent out on the | ||
// stream for the resource type when there is sufficient flow control quota. | ||
// stream for the resource type with the updated set of resource names. | ||
func (s *adsStreamImpl) Unsubscribe(typ ResourceType, name string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I noticed that the subscribe
method is unexported, while Unsubscribe
is exported. Maybe we should unexport Unsubscribe
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I see. The difference in the scenario I asked about and the problem caused by the initial fix is the presence of the LDS resource in the cache. If L0 is present in the cache but not in the latest update from the management server, the channel is put in TF. However, if L0 is not present in the cache, we wait for the expiry timer to expire. |
Though the PR description doesn't explicitly call out how the change addresses the race condition, my understanding is that the race is caused because the writes for subscribe and unsubscribe calls are batched. After batching, it may appear that nothing has changed in the list of subscribed resources, even though a resource was removed and added again. By removing batching of writes, the management server will see the unsubscription and re-subscription of the resource and send the necessary update. |
That is true. And the first part about the resource being in the cache, but not in a response from the management server resulting in the channel moving to TF applies only to LDS and CDS. For RDS and EDS, if a similar scenario happens, we would simply continue using the resource in the cache. |
Your understanding is correct and I've clarified the exact underlying issue causing the race and how the change addresses them in the PR description. Thanks. |
// | ||
// It's expected to be used in scenarios where the buffered data is no longer | ||
// relevant, and needs to be cleared. | ||
func (b *Unbounded) Reset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inherently racy with other calls to Put()
(and Load()
, of course). I hope we are using it correctly! :) And if we have an external lock that is used for all things that call Put and Reset, I wonder if a different data structure might be more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inherently racy with other calls to Put() (and Load(), of course). I hope we are using it correctly!
I had the same question. My understanding is that Reset acts as an optimization to avoid redundant requests, but it isn't required for correctness.
Even so, you're right about the potential for races. I think adding a godoc comment to warn about this is a good idea.
state, ok := s.resourceTypeState[typ] | ||
if !ok { | ||
// State is created when the first subscription for this type is made. | ||
panic(fmt.Sprintf("no state exists for resource type %v", typ)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a production panic? Or should this be log.Errorf()
(which fails all tests) and a return <some error>
(or nil
since the stream is still usable, depending on what it means to return an error?)? The latter feels safer.
|
||
// Clear any queued requests. Previously subscribed to resources will be | ||
// resent below. | ||
s.requestCh.Reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to mandate that s.mu
is held when calling Put
or Reset
?
Maybe we should wrap accesses to it somehow to guarantee that?
Fixes #8125
The original race in the xDS client:
The original fix:
Delay the resource's removal from the cache until the unsubscribe request was transmitted over the wire, a change implemented in #8369. However, this solution introduced new complications:
The root cause of the previous seen races can be put down two things:
How does this PR address the above issue?
This PR simplifies the implementation of the ADS stream by removing two pieces of functionality
A
,B
, andC
, the stream would now send three requests:[A]
,[A B]
,[A B C]
.RELEASE NOTES: