Skip to content

Conversation

@arjan-bal
Copy link
Contributor

When a resolver fails to be created, the current behavior is inconsistent and makes debugging difficult (see discussion on #8602 (comment)):

  • Dial: Immediately returns an error.
  • NewClient: Logs an INFO message only when an RPC is attempted, while the channel remains in an IDLE state. This can hide the underlying configuration issue.

This change unifies the behavior by transitioning the channel to TRANSIENT_FAILURE upon a resolver creation failure. As a result, RPCs will fail immediately with the specific error produced by the resolver builder, making the problem easier to diagnose.

RELEASE NOTES:

  • client: When a resolver fails to be created, the channel now transitions to TRANSIENT_FAILURE instead of staying IDLE.
  • client: The deprecated Dial function no longer returns an error on resolver creation failure, instead the channel is created and it enters TRANSIENT_FAILURE state.

@arjan-bal arjan-bal added Type: Bug Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Behavior Change Behavior changes not categorized as bugs and removed Type: Bug Type: Behavior Change Behavior changes not categorized as bugs labels Oct 13, 2025
@arjan-bal arjan-bal added this to the 1.77 Release milestone Oct 13, 2025
@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 31.81818% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.50%. Comparing base (8110884) to head (a14e7c5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
resolver_wrapper.go 28.57% 12 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8643      +/-   ##
==========================================
- Coverage   79.45%   78.50%   -0.95%     
==========================================
  Files         415      415              
  Lines       41339    41556     +217     
==========================================
- Hits        32844    32623     -221     
- Misses       6621     6843     +222     
- Partials     1874     2090     +216     
Files with missing lines Coverage Δ
clientconn.go 60.62% <100.00%> (-29.19%) ⬇️
resolver_wrapper.go 59.14% <28.57%> (-32.33%) ⬇️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal requested review from dfawley and easwars October 13, 2025 06:46
@arjan-bal arjan-bal added Type: Behavior Change Behavior changes not categorized as bugs and removed Type: Bug labels Oct 13, 2025
@dfawley
Copy link
Member

dfawley commented Oct 13, 2025

Is it possible to maintain the existing behavior for Dial while still getting the other improvements for existing clients exiting idle mode? Generally, resolver.Build isn't supposed to fail, so maybe it's not critical, but I'd rather avoid the behavior change if possible.

And it seems like we need a regression test for the case where Build fails and an RPC is attempted, to ensure the error message is propagated as we intend.

Also, will the channel still go back into idle mode correctly upon the idle timeout being reached? We might want to have a test for this, too.

@dfawley dfawley assigned arjan-bal and unassigned dfawley Oct 13, 2025
@easwars
Copy link
Contributor

easwars commented Oct 13, 2025

I remember us discussing (with Eric) the option of Build never returning an error, but returning an ErroringResolver when things fail? With this option, the ErroringResolver would send the error to the channel and have RPCs fail with that error.

And if we have a utility for the ErroringResolver that individual resolver implementations can use, it will make their lives easier. The channel could also use it for resolver implementations that are not yet updated and instead return a non-nil error from Build.

Also, will the channel still go back into idle mode correctly upon the idle timeout being reached? We might want to have a test for this, too.

What is the benefit of transitioning back to idle in this case?

@dfawley
Copy link
Member

dfawley commented Oct 13, 2025

I remember us discussing (with Eric) the option of Build never returning an error, but returning an ErroringResolver when things fail? With this option, the ErroringResolver would send the error to the channel and have RPCs fail with that error.

That would be a large API change that I'd rather not mess with unless it's part of #6472. This effectively does exactly that when a resolver returns an error, so I think it's completely equivalent in functionality.

What is the benefit of transitioning back to idle in this case?

Consistency, I guess? The resolver shouldn't be doing non-static things while building, so it should presumably continue to fail to build no matter what. But we also don't document this, really, so it's also possible users are doing weird things in custom resolvers, and having a later opportunity to succeed might be important to them.

@arjan-bal arjan-bal force-pushed the enter-tf-resolver-build-failure branch from ddf60e3 to a14e7c5 Compare October 14, 2025 07:10
@arjan-bal
Copy link
Contributor Author

After investigating, I confirmed that RPCs already fail with the correct error message. The connection is attempted when the stream creation function calls exitIdle, as seen here:

grpc-go/stream.go

Lines 180 to 186 in 2d92271

func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, method string, opts ...CallOption) (_ ClientStream, err error) {
// Start tracking the RPC for idleness purposes. This is where a stream is
// created for both streaming and unary RPCs, and hence is a good place to
// track active RPC count.
if err := cc.idlenessMgr.OnCallBegin(); err != nil {
return nil, err
}

Therefore, no changes are needed to propagate the error. While cc.Connect itself cannot return an error due to its signature, any resolver build failure is correctly surfaced to the user during the next RPC.

@arjan-bal arjan-bal closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Behavior Change Behavior changes not categorized as bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants