Skip to content

[Feat] SLO-Aware Routing with Latency Prediction #1323

@BenjaminBraunDev

Description

@BenjaminBraunDev

SLO Aware Routing is a strategy to satisfy individual request TTFT and TPOT latency SLOs leveraging latency prediction to optimize pod selection.

We propose several experimental changes to the request flow, including a new scheduling profile, training and prediction sidecars, and live request tracking for each pod in datastore.

EPP Deployment Requirements

  1. Include the latency prediction flag in EPP runtime flags
  2. Add latency and training sidecars as containers to deployment yaml
  3. Include 2 scheduling profiles, one for just latency prediction, and another for combined prediction and routing

Example scheduling profiles:

apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    - type: slo-request-tracker
    - type: slo-scorer
    - type: slo-aware-profile-handler
    - type: weighted-random-picker
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: slo-request-tracker
      - pluginRef: queue-scorer
      - pluginRef: kv-cache-utilization-scorer
      - pluginRef: prefix-cache-scorer
    - name: slo
      plugins:
      - pluginRef: prefix-cache-scorer
        weight: 0
      - pluginRef: slo-request-tracker
      - pluginRef: slo-scorer
      - pluginRef: weighted-random-picker

New Plugins:

  1. slo-aware-profile-handler
    Chooses the scheduling profile depending on the PredictionBasedRouting boolean header value.

  2. slo-request-tracker
    This handles the request tracking logic for latency prediction, adding requests to the list of running requests on pods as they are scheduled, and interacts with sidecards, making prediction requests and sending the required training data after requests complete.

  3. slo-scorer
    Performs the scoring for Prediction Based Routing, based on the latency prediction of a request for each pod, chooses the best pod to schedule the request to.

  4. weighted-random-picker
    A picker to choose randomly between many different pods based on weights assigned to them from the scorers.

Request Flow

The request flow for will be as follows:

  1. Request received by gateway and may have TTFT and TPOT SLOs and PredictionBasedRouting in the header.
  2. slo-aware-profile-handler checks PredictionBasedRouting header: if True, proceed to use an SLO-aware scheduling profile instead of the default:
    a. For each potential pod, run latency prediction and store it in memory along the request path.
    b. Identify "valid" pods, defined as the pods predicted to serve the request within its SLO and within the SLO of all running requests, or have no running requests (checking the list of running requests on each pod).

Prediction Based Routing: (done in slo-scorer plugin):

  • if len(valid_pods) > 0: Return a weighted random draw favoring pods (done by weighted-random-picker) with the lowest OR highest positive headroom based on a "scheduling strategy" environment variable:
    -Lowest: Assign to pods that have just enough resources to meet SLO, maintaining pods with high headroom for large critical requests
    -Highest: Assign to pods that have substantial resources to meet SLO, so as to evenly distribute load.
    (Both options, perhaps a very small chance of choosing an invalid pod, for exploration for training purposes)
  • else if len(valid_pods) == 0:
    -if request is critical: Return a weighted random draw favoring pods with the lowest negative headroom (least “overwhelmed” pods among those not meeting SLO).
    -else if request is NOT critical: shed request [Saturation Detection]
  1. Once a pod is decided, store the request with predicted ttft/tpot in datastore under that pods running requests
  2. Forward request to the selected pod endpoint, slo-request-tracker adds it to the datastore running request queue for that pod.
  3. Continuously add the history of actual latencies and predicted latencies to the running requests on the pod in the datastore.
  4. Once a request completes, slo-request-tracker removes it from the queue.

Diagram

Image

Key Considerations

Headroom

We define Headroom not as the difference between a specific requests SLO and its predicted latency, but rather the difference between the lowest SLO of all the running requests on the pod.

Example:
Say we want to serve a request with 80ms TPOT SLO.

pod-1 has a running request with 50ms TPOT SLO (likely currently being served at <50ms TPOT), predictor says that scheduling this new request would have a predicted latency of 60ms TPOT, scorer would then consider this pod to have -10ms of headroom, not 20ms

If there are no other pods with positive headroom and this request is non-critical then we shed, as it is predicted to violate (i.e. no pod has positive headroom for it), but if it's critical we schedule it to the least negative headroom pod.

Pods in Datastore

We add predictions and SLO to a new priority queue of running requests on each pod, stored as:

  • Request ID
  • SLO
  • Predicted TTFT
  • Predicted TPOT

Criticality

Criticality will be handled by the layer above scheduling, allowing the scheduler to focus on efficient bin-packing. The saturation detector will be responsible for shedding non-critical requests if SLOs cannot be met.

Prefix Score Reuse

The new SLO-aware profile can reuse the existing prefix score logic.

No SLO Provided

If the latency prediction flag is enabled in EPP, we require all requests to provide an SLO, error if otherwise. We decided on this behavior since a mixture of SLO and non-SLO requests will serve the non-SLO requests better, which is counter intuitive since by specifying a latency target you are expecting the algorithm to try and meet it. It is best to set this expectation as a requirement.

v0 Implementation

For our initial implementation, the only feature we don't intend to incorporate is the Saturation Detection inside the SLO profile. This is because upcoming flow control changes will change how criticality behaves and we want to maintain compatibility with that new system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions