You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/guides/autoscaling.mdx
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,9 +18,10 @@ Scaling to zero means that your Inference Endpoint will go idle after a given du
18
18
very useful when you want to optimize for low costs or when your workloads are intermittent.
19
19
20
20
Scaling to zero replicas helps optimize cost savings by minimizing resource usage during periods of inactivity. However, it's important to
21
-
be aware that scaling to 0 implies a cold start period when the endpoint receives a new request. Additionally, the HTTP server will
22
-
respond with a status code `502 Bad Gateway` while the new replica is initializing. Currently there is no queueing system in place for
23
-
incoming requests.
21
+
be aware that scaling to zero implies a cold start period when the endpoint receives a new request. Additionally, the proxy will
22
+
respond with the status code `503` while the new replica is initializing. To potentially avoid this, you can also add the
23
+
'X-Scale-Up-Timeout' header to your requests. This means that when the endpoint is scaling the proxy will hold the request until a replica
24
+
is ready, or timeout after the specified amount of seconds. For example 'X-Scale-Up-Timeout: 600' would wait for 600 seconds.
24
25
25
26
<Tip>
26
27
Note that scaling up can take a few minutes depending on the model, which means that scaling from 0 to 1 based on a request is typically not recommended if your
0 commit comments