least_request LB strategy causes full TPS drop when upstream endpoint hang

### Is your feature request related to a problem?
**Yes.** The _least_request_ load balancing strategy can cause **a complete TPS drop** when a single upstream endpoint hangs. This occurs due to two primary factors:

1. Long request timeouts (like 30 seconds or more) make this much worse.
2. As shown in [LeastRequestLoadBalancer$ReadyPicker.nextChildToUse()](https://github.com/grpc/grpc-java/blob/master/xds/src/main/java/io/grpc/xds/LeastRequestLoadBalancer.java), the N_CHOICES selection method randomly picks two endpoints. It may select the same unhealthy endpoint twice(instead of two distinct endpoints).

When this occurs, all traffic is routed to the hanged up endpoint, causing a full service degradation, which is unacceptable.

### Describe the solution you'd like
1. Suport for FULL_SCAN mode of xDS LEAST_REQUEST Load Balancer Policy, which would check all endpoints before picking one.
2. Adjust the N_CHOICES algorithm to prevent it from picking the same endpoint twice, like record which endpoints were already chosen, or something else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

least_request LB strategy causes full TPS drop when upstream endpoint hang #12237

Is your feature request related to a problem?

Describe the solution you'd like

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

least_request LB strategy causes full TPS drop when upstream endpoint hang #12237

Description

Is your feature request related to a problem?

Describe the solution you'd like

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions