Skip to content

Commit 7a03397

Browse files
committed
add header and correct status code
1 parent 9f41d7a commit 7a03397

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

docs/source/guides/autoscaling.mdx

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,10 @@ Scaling to zero means that your Inference Endpoint will go idle after a given du
1818
very useful when you want to optimize for low costs or when your workloads are intermittent.
1919

2020
Scaling to zero replicas helps optimize cost savings by minimizing resource usage during periods of inactivity. However, it's important to
21-
be aware that scaling to 0 implies a cold start period when the endpoint receives a new request. Additionally, the HTTP server will
22-
respond with a status code `502 Bad Gateway` while the new replica is initializing. Currently there is no queueing system in place for
23-
incoming requests.
21+
be aware that scaling to zero implies a cold start period when the endpoint receives a new request. Additionally, the proxy will
22+
respond with the status code `503` while the new replica is initializing. To potentially avoid this, you can also add the
23+
'X-Scale-Up-Timeout' header to your requests. This means that when the endpoint is scaling the proxy will hold the request until a replica
24+
is ready, or timeout after the specified amount of seconds. For example 'X-Scale-Up-Timeout: 600' would wait for 600 seconds.
2425

2526
<Tip>
2627
Note that scaling up can take a few minutes depending on the model, which means that scaling from 0 to 1 based on a request is typically not recommended if your

0 commit comments

Comments
 (0)