Skip to content

Conversation

eshitachandwani
Copy link
Member

@eshitachandwani eshitachandwani commented Aug 20, 2025

Fixes: #8525

There is a race in SetWatchExpiryTimeoutForTesting which is used to override the watch expiry timeout of XDSClient for testing. Currently it just sets the watchExpiryTimeout of the XDSClient to the provided value without a mutex each time we call NewClientForTesting which might of might not create a new XDSClient if one is already there.

Fix : Add a new field WatchExpiryTimeout to the xdsclient config which will now be used instead of internal.WatchExpiryTImeout

RELEASE NOTES: None

@eshitachandwani eshitachandwani added this to the 1.76 Release milestone Aug 20, 2025
@eshitachandwani eshitachandwani added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Aug 20, 2025
@eshitachandwani
Copy link
Member Author

Test will be added in #8483 along with the fix for LRS Client because the test fails for both race conditions.

Copy link

codecov bot commented Aug 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.98%. Comparing base (3074bcd) to head (99f6d0d).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8526      +/-   ##
==========================================
- Coverage   82.02%   81.98%   -0.05%     
==========================================
  Files         412      413       +1     
  Lines       40465    40523      +58     
==========================================
+ Hits        33191    33222      +31     
- Misses       5887     5910      +23     
- Partials     1387     1391       +4     
Files with missing lines Coverage Δ
internal/xds/clients/xdsclient/xdsclient.go 80.34% <100.00%> (-1.32%) ⬇️
internal/xds/clients/xdsclient/xdsconfig.go 100.00% <ø> (ø)
internal/xds/xdsclient/clientimpl.go 82.20% <100.00%> (+0.15%) ⬆️
internal/xds/xdsclient/pool.go 75.37% <100.00%> (-0.19%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal
Copy link
Contributor

arjan-bal commented Aug 20, 2025

The expiry timeout seems like something that should only be set at the beginning of the test and not something that needs to be configured per xDS client. If so, we can have SetWatchExpiryTimeoutForTesting be a function instead of a method on XDSClient and have it change the global timeout variable.

// WatchExpiryTimeout is the watch expiry timeout for xDS client. It can be
// overridden by tests to change the default watch expiry timeout.
WatchExpiryTimeout time.Duration

Removing the setter would give the test control over when the variable is written, making it easier to avoid races. It also avoids surprises when the timeout being passed is not used because a cached client is returned.

func (c *XDSClient) SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) {
c.watchExpiryTimeout = watchExpiryTimeout
// with provided timeout value and returns a function to reset the timeout to the
// original value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention that this function should not be called concurrently. It should be called before making any RPCs to avoid races.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arjan-bal arjan-bal removed their assignment Aug 22, 2025
@easwars easwars changed the title xds/xdsclient: Fix race in SetWatchExpiryTimeoutForTesting xdsclient: Fix race in SetWatchExpiryTimeoutForTesting Aug 25, 2025
//
// Note: This function should not be called concurrently and must be called
// before any RPCs to avoid race conditions.
func SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind was to get rid of this function completely. Instead do the following:

  • In Pool.NewClientForTesting, set the value of WatchExpiryTimeout defined in internal/xds/clients/xdsclient/internal/internal.go, and revert the value as part of the returned cancel function.
    • Now, if you cannot do that because the value of defined in an internal package and you don't have access to it from the package where Pool.NewClientForTesting is defined, please add a SetWatchExpiryTimeoutForTesting in the external xdsclient so that it is accessible from here. And in the docstring of that function, please specify that this method should be called before creating the xDS client.
  • If you end up adding the above specified function, you can even unexport the WatchExpiryTimeout value as you now have a way to write to it through a function.

With the above approach:

  • Existing tests would just have to set the WatchExpiryTimeout field inside of OptionsForTesting when calling NewClientForTesting and will not have to deal with resetting the value, as calling the returned cancel func would end up doing that.

Please let me know if this sounds OK to you. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually did try that approach earlier, i.e. this function SetWatchExpiryTimeoutForTesting will do what it does now but will be called from Pool.NewClientForTesting and get reset in the cancel function but it is causing a race between reseting the variable and setting it in a new client impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying. How about the following approach then:

With this approach, the watch expiry timeout will be truly a per-client setting and not something that will be shared with other clients within the same process.

Also, with this approach, we can get rid of the XDSClient.SetWatchExpiryTimeoutForTesting, and also the two variables defined in https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/internal/internal.go.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works , it wont cause any race since its specific for each client. I have made the change for the watchExpiryTimeout. Do we also make the change for Stream backoff , because it is being override only testwise and only in one test file and not causing any races?

Comment on lines 64 to 66
// WatchExpiryTimeout is the timeout for xDS resource watch expiry. If
// unspecified, uses the default value used in non-test code.
WatchExpiryTimeout time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather that we not delete this field and instead use this to set the value in the generic xDS client (either directly or through a setter provided by the latter).

@easwars easwars assigned eshitachandwani and unassigned easwars Aug 25, 2025
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. modulo minor nits

@easwars
Copy link
Contributor

easwars commented Aug 29, 2025

Please update PR description as it becomes part of our git commit history.

@easwars easwars assigned eshitachandwani and unassigned easwars Aug 29, 2025
@eshitachandwani eshitachandwani merged commit 29ba001 into grpc:master Aug 30, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xds/xdsclient: race in SetWatchExpiryTimeoutForTesting
3 participants