-
Notifications
You must be signed in to change notification settings - Fork 162
Description
SLO Aware Routing is a strategy to satisfy individual request TTFT and TPOT latency SLOs leveraging latency prediction to optimize pod selection.
We propose several experimental changes to the request flow, including a new scheduling profile, training and prediction sidecars, and live request tracking for each pod in datastore.
EPP Deployment Requirements
- Include the latency prediction flag in EPP runtime flags
- Add latency and training sidecars as containers to deployment yaml
- Include 2 scheduling profiles, one for just latency prediction, and another for combined prediction and routing
Example scheduling profiles:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: queue-scorer
- type: kv-cache-utilization-scorer
- type: prefix-cache-scorer
- type: slo-request-tracker
- type: slo-scorer
- type: slo-aware-profile-handler
- type: weighted-random-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: slo-request-tracker
- pluginRef: queue-scorer
- pluginRef: kv-cache-utilization-scorer
- pluginRef: prefix-cache-scorer
- name: slo
plugins:
- pluginRef: prefix-cache-scorer
weight: 0
- pluginRef: slo-request-tracker
- pluginRef: slo-scorer
- pluginRef: weighted-random-picker
New Plugins:
-
slo-aware-profile-handler
Chooses the scheduling profile depending on the PredictionBasedRouting boolean header value. -
slo-request-tracker
This handles the request tracking logic for latency prediction, adding requests to the list of running requests on pods as they are scheduled, and interacts with sidecards, making prediction requests and sending the required training data after requests complete. -
slo-scorer
Performs the scoring for Prediction Based Routing, based on the latency prediction of a request for each pod, chooses the best pod to schedule the request to. -
weighted-random-picker
A picker to choose randomly between many different pods based on weights assigned to them from the scorers.
Request Flow
The request flow for will be as follows:
- Request received by gateway and may have TTFT and TPOT SLOs and PredictionBasedRouting in the header.
slo-aware-profile-handler
checks PredictionBasedRouting header: if True, proceed to use an SLO-aware scheduling profile instead of the default:
a. For each potential pod, run latency prediction and store it in memory along the request path.
b. Identify "valid" pods, defined as the pods predicted to serve the request within its SLO and within the SLO of all running requests, or have no running requests (checking the list of running requests on each pod).
Prediction Based Routing: (done in slo-scorer
plugin):
- if len(valid_pods) > 0: Return a weighted random draw favoring pods (done by
weighted-random-picker
) with the lowest OR highest positive headroom based on a "scheduling strategy" environment variable:
-Lowest: Assign to pods that have just enough resources to meet SLO, maintaining pods with high headroom for large critical requests
-Highest: Assign to pods that have substantial resources to meet SLO, so as to evenly distribute load.
(Both options, perhaps a very small chance of choosing an invalid pod, for exploration for training purposes) - else if len(valid_pods) == 0:
-if request is critical: Return a weighted random draw favoring pods with the lowest negative headroom (least “overwhelmed” pods among those not meeting SLO).
-else if request is NOT critical: shed request [Saturation Detection]
- Once a pod is decided, store the request with predicted ttft/tpot in datastore under that pods running requests
- Forward request to the selected pod endpoint,
slo-request-tracker
adds it to the datastore running request queue for that pod. - Continuously add the history of actual latencies and predicted latencies to the running requests on the pod in the datastore.
- Once a request completes,
slo-request-tracker
removes it from the queue.
Diagram

Key Considerations
Headroom
We define Headroom not as the difference between a specific requests SLO and its predicted latency, but rather the difference between the lowest SLO of all the running requests on the pod.
Example:
Say we want to serve a request with 80ms TPOT SLO.
pod-1 has a running request with 50ms TPOT SLO (likely currently being served at <50ms TPOT), predictor says that scheduling this new request would have a predicted latency of 60ms TPOT, scorer would then consider this pod to have -10ms of headroom, not 20ms
If there are no other pods with positive headroom and this request is non-critical then we shed, as it is predicted to violate (i.e. no pod has positive headroom for it), but if it's critical we schedule it to the least negative headroom pod.
Pods in Datastore
We add predictions and SLO to a new priority queue of running requests on each pod, stored as:
- Request ID
- SLO
- Predicted TTFT
- Predicted TPOT
Criticality
Criticality will be handled by the layer above scheduling, allowing the scheduler to focus on efficient bin-packing. The saturation detector will be responsible for shedding non-critical requests if SLOs cannot be met.
Prefix Score Reuse
The new SLO-aware profile can reuse the existing prefix score logic.
No SLO Provided
If the latency prediction flag is enabled in EPP, we require all requests to provide an SLO, error if otherwise. We decided on this behavior since a mixture of SLO and non-SLO requests will serve the non-SLO requests better, which is counter intuitive since by specifying a latency target you are expecting the algorithm to try and meet it. It is best to set this expectation as a requirement.
v0 Implementation
For our initial implementation, the only feature we don't intend to incorporate is the Saturation Detection inside the SLO profile. This is because upcoming flow control changes will change how criticality behaves and we want to maintain compatibility with that new system.