Skip to content

least_request LB strategy causes full TPS drop when upstream endpoint hang #12237

@jiangzhchyeah

Description

@jiangzhchyeah

Is your feature request related to a problem?

Yes. The least_request load balancing strategy can cause a complete TPS drop when a single upstream endpoint hangs. This occurs due to two primary factors:

  1. Long request timeouts (like 30 seconds or more) make this much worse.
  2. As shown in LeastRequestLoadBalancer$ReadyPicker.nextChildToUse(), the N_CHOICES selection method randomly picks two endpoints. It may select the same unhealthy endpoint twice(instead of two distinct endpoints).

When this occurs, all traffic is routed to the hanged up endpoint, causing a full service degradation, which is unacceptable.

Describe the solution you'd like

  1. Suport for FULL_SCAN mode of xDS LEAST_REQUEST Load Balancer Policy, which would check all endpoints before picking one.
  2. Adjust the N_CHOICES algorithm to prevent it from picking the same endpoint twice, like record which endpoints were already chosen, or something else.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions