-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xdsclient: Fix race in SetWatchExpiryTimeoutForTesting #8526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test will be added in #8483 along with the fix for LRS Client because the test fails for both race conditions. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8526 +/- ##
==========================================
- Coverage 82.02% 81.98% -0.05%
==========================================
Files 412 413 +1
Lines 40465 40523 +58
==========================================
+ Hits 33191 33222 +31
- Misses 5887 5910 +23
- Partials 1387 1391 +4
🚀 New features to boost your workflow:
|
The expiry timeout seems like something that should only be set at the beginning of the test and not something that needs to be configured per xDS client. If so, we can have grpc-go/internal/xds/clients/xdsclient/internal/internal.go Lines 24 to 26 in fa0d658
Removing the setter would give the test control over when the variable is written, making it easier to avoid races. It also avoids surprises when the timeout being passed is not used because a cached client is returned. |
internal/xds/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
internal/xds/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
func (c *XDSClient) SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) { | ||
c.watchExpiryTimeout = watchExpiryTimeout | ||
// with provided timeout value and returns a function to reset the timeout to the | ||
// original value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention that this function should not be called concurrently. It should be called before making any RPCs to avoid races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// | ||
// Note: This function should not be called concurrently and must be called | ||
// before any RPCs to avoid race conditions. | ||
func SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind was to get rid of this function completely. Instead do the following:
- In
Pool.NewClientForTesting
, set the value ofWatchExpiryTimeout
defined ininternal/xds/clients/xdsclient/internal/internal.go
, and revert the value as part of the returned cancel function.- Now, if you cannot do that because the value of defined in an internal package and you don't have access to it from the package where
Pool.NewClientForTesting
is defined, please add aSetWatchExpiryTimeoutForTesting
in the external xdsclient so that it is accessible from here. And in the docstring of that function, please specify that this method should be called before creating the xDS client.
- Now, if you cannot do that because the value of defined in an internal package and you don't have access to it from the package where
- If you end up adding the above specified function, you can even unexport the
WatchExpiryTimeout
value as you now have a way to write to it through a function.
With the above approach:
- Existing tests would just have to set the
WatchExpiryTimeout
field inside ofOptionsForTesting
when callingNewClientForTesting
and will not have to deal with resetting the value, as calling the returned cancel func would end up doing that.
Please let me know if this sounds OK to you. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually did try that approach earlier, i.e. this function SetWatchExpiryTimeoutForTesting
will do what it does now but will be called from Pool.NewClientForTesting
and get reset in the cancel function but it is causing a race between reseting the variable and setting it in a new client impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying. How about the following approach then:
- The generic xDS client currently takes a
Config
struct as part of itsNew
function. See: https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/xdsconfig.go#L28- We should add a couple of fields in there:
- Watch expiry timeout. Clearly document that most users will not need to set a value for this and that we will use a default value of 15 seconds. And also maybe link to this: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist
- Stream backoff function that returns the duration to backoff when an ADS stream fails without recieving a single response from the server. Again clearly document that most users will not need to set a value for this and that we will use a default exponential backoff as specified here: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md. Mentioning this in the docs might also be an OK thing to do, but I'm not certain of it: https://github.com/grpc/proposal/blob/master/A57-xds-client-failure-mode-behavior.md#handling-ads-stream-termination
- We should add a couple of fields in there:
- From
pool.NewClient
andpool.NewClientForTesting
, we need to pass the above two pieces of information topool.newRefCounted
and from there tonewClientImpl
- In
newClientImpl
, we need to set the two new fields in the xds config passed to thexdsclient.New
With this approach, the watch expiry timeout will be truly a per-client setting and not something that will be shared with other clients within the same process.
Also, with this approach, we can get rid of the XDSClient.SetWatchExpiryTimeoutForTesting
, and also the two variables defined in https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/internal/internal.go.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works , it wont cause any race since its specific for each client. I have made the change for the watchExpiryTimeout. Do we also make the change for Stream backoff , because it is being override only testwise and only in one test file and not causing any races?
// WatchExpiryTimeout is the timeout for xDS resource watch expiry. If | ||
// unspecified, uses the default value used in non-test code. | ||
WatchExpiryTimeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather that we not delete this field and instead use this to set the value in the generic xDS client (either directly or through a setter provided by the latter).
db0719e
to
fd6c6f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. modulo minor nits
Please update PR description as it becomes part of our git commit history. |
Fixes: #8525
There is a race in SetWatchExpiryTimeoutForTesting which is used to override the watch expiry timeout of XDSClient for testing. Currently it just sets the watchExpiryTimeout of the XDSClient to the provided value without a mutex each time we call NewClientForTesting which might of might not create a new XDSClient if one is already there.
Fix : Add a new field
WatchExpiryTimeout
to the xdsclient config which will now be used instead ofinternal.WatchExpiryTImeout
RELEASE NOTES: None