huggingface
diff --git a/‎assets/tutorials/chatbot/endpoint-page.png
147 KB b/‎assets/tutorials/chatbot/endpoint-page.png
147 KB
diff --git a/‎assets/tutorials/embedding/endpoint-page.png
277 KB b/‎assets/tutorials/embedding/endpoint-page.png
277 KB
diff --git a/‎docs/source/about.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/about.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/engines/custom_container.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/engines/custom_container.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/engines/sglang.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/engines/sglang.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/engines/tei.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/engines/tei.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/engines/tgi.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/engines/tgi.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/engines/vllm.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/engines/vllm.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/guides/analytics.mdx
Lines changed: 8 additions & 8 deletions b/‎docs/source/guides/analytics.mdx
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/source/guides/autoscaling.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/guides/autoscaling.mdx
Lines changed: 1 addition & 1 deletion
@@ -18,7 +18,7 @@ Inference Endpoints brings all these pieces together into a single managed servi
 inference engine, and Inference Endpoints takes care of the rest—provisioning infrastructure, deploying your model, and making it
 accessible via a simple API. This lets you focus on building your application, while we handle the complexity of production AI deployment.
 
-![about](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/about.png)
+![about](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/about.png)
 
 ## Inference Enginges
 
 
@@ -5,7 +5,7 @@ it on Inference Endpoints. You can use public images like `tensorflow/serving:2.
 [Docker Hub](https://hub.docker.com/), [AWS ECR](https://aws.amazon.com/ecr/?nc1=h_ls),
 [Azure ACR](https://azure.microsoft.com/de-de/services/container-registry/), or [Google GCR](https://cloud.google.com/container-registry?hl=de).
 
-![custom container](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/custom_container/custom-container.png)
+![custom container](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/custom_container/custom-container.png)
 
 The [creation flow](/docs/inference-endpoints/guides/create_endpoint) of your image artifacts from a custom image is the same as the
 base image. This means Inference Endpoints will create a unique image artifact derived from your provided image, including all model
 
@@ -15,7 +15,7 @@ embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy
 
 ## Configuration
 
-![sglang](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/sglang/sglang.png)
+![sglang](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/sglang/sglang.png)
 
 - **Max Running Request**: the max number of concurrent requests
 - **Max Prefill Tokens** (per batch): the maximum number of tokens that can be processed in a single prefill operation. This controls the batch size for the prefill phase and helps manage memory usage during prompt processing.
@@ -31,7 +31,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
 For more advanced configuration you can pass any of the [Server Arguments that SGlang supports](https://docs.sglang.ai/backend/server_arguments.html)
 as container arguments. For example changing the `schedule-policy` to `lpm` would look like this:
 
-![sglang-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/sglang/sglang-advanced.png)
+![sglang-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/sglang/sglang-advanced.png)
 
 ## Supported models
 
 
@@ -13,7 +13,7 @@ Key Features:
 
 ## Configuration
 
-![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tei/tei.png)
+![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tei/tei.png)
 
 - **Max Tokens (per batch)**: Number of tokens that can be added to a batch before forcing queries to wait in the internal queue. 
 - **Max Concurrent Requests**: The maximum number of requests that the server can handle at once.
 
@@ -16,7 +16,7 @@ the container URL](https://raw.githubusercontent.com/not-here)
 When selecting a model to deploy, the Inference Endpoints UI automatically checks whether a model is supported by TGI. If it is, you'll see
 the option presented under `Container Configuration` where you can change the following settings:
 
-![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tgi/tgi_config.png)
+![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tgi/tgi_config.png)
 
 - **Quantization**: Which quantization method, if any, to use for the model.
 - **Max Number of Tokens (per query)**: Changes the maximum amount of tokens a request can contain.
@@ -54,7 +54,7 @@ You can find the models that are supported by TGI:
 - A selection of popular models in the [Inference Endpoints Catalog](https://endpoints.huggingface.co/huggingface/catalog)
 
 If a model is supported by TGI, the Inference Endpoints UI will indicate this by disabling/enabling the selection under `Container Type` configuration.
-![selection](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tgi/tgi_selection.png)
+![selection](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tgi/tgi_selection.png)
 
 ## References
 
 
@@ -12,7 +12,7 @@ batching, and decoding—all wrapped in a production-ready server. For most use
 
 ## Configuration
 
-![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/vllm/vllm_config.png)
+![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/vllm/vllm_config.png)
 
 - **Max Number of Sequences**: The maximum number of sequences (requests) that can be processed together in a single batch. Controls
 the batch size by sequence count, affecting throughput and memory usage. For example, if max_num_seqs=8, up to 8 different prompts can
@@ -27,7 +27,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
 For more advanced configuration you can pass any of the [Engine Arguments that vLLM supports](https://docs.vllm.ai/en/stable/api/vllm/engine/arg_utils.html#vllm.engine.arg_utils.EngineArgs)
 as container arguments. For example changing the `enable_lora` to `true` would look like this:
 
-![vllm-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/vllm/vllm-advanced.png)
+![vllm-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/vllm/vllm-advanced.png)
 
 ## Supported models
 
 
@@ -4,13 +4,13 @@ The Analytics page is like the control center for your deployed models. It tells
 calling your models, about hardware usage, latencies, and much more. In this documentation we'll dive into what each metric means and
 how to analyze the graphs.
 
-![intro](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/1-intro.png)
+![intro](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/1-intro.png)
 
 In the top bar, you can configure for which time frame you'll inspect the metrics, this setting affects all graphs on the page. You can choose
 between any of the existing settings from the drop down, or click-and-drag over any graph for a custom timeframe. You can also enable/disable
 auto refresh or view the metrics per replica or for all.
 
-![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/2-config.png)
+![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/2-config.png)
 
 ## Understanding the graphs
 
@@ -28,7 +28,7 @@ classes, but by switching the toggle you can view them by individual status. As
 We recommend checking the [MDN web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status) for more information on individual
 status codes.
 
-![http](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/3-http-reqs.png)
+![http](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/3-http-reqs.png)
 
 ### Pending Requests
 
@@ -37,7 +37,7 @@ being processed. If this metric increases too much, it means that your requests
 to finish. In this case you should consider increasing your number of replicas or alternatively use autoscaling, you can read more about
 it in the [autoscaling guide](./autoscaling#scalingbasedonpendingrequests(betafeature))
 
-![pending](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/4-pending-reqs.png)
+![pending](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/4-pending-reqs.png)
 
 ### Latency Distribution
 
@@ -52,19 +52,19 @@ Usually a good metric is also to look at how big the difference is between the m
 uniform the latency is, whereas if the difference is large, it means that the users of your Inference Endpoint have in general a fast response but
 the worst case latencies can be long.
 
-![latency](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/5-latency.png)
+![latency](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/5-latency.png)
 
 ### Replica Status
 
 In the replica status graph, you'll see in the basic view how many running replicas you have during a point in time. The red line shows
 what your current maximum replicas setting is. 
 
-![status](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/6-status.png)
+![status](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/6-status.png)
 
 If you toggle the advanced setting, you'll instead see the different statuses of the individual replicas, going from *pending* all the way
 to *running*. This is very useful to get a sense of how long it takes for an endpoint to actually become ready to serve requests.
 
-![advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/7-status-advanced.png)
+![advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/7-status-advanced.png)
 
 ### Hardware Usage
 
@@ -75,7 +75,7 @@ The last four graphs are dedicated for the hardware usage. You'll find:
 - GPU usage: How much of the GPU's processing power is being used.
 - GPU Memory (VRAM) usage: How much GPU memory is being used.
 
-![usage](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/8-usage.png)
+![usage](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/8-usage.png)
 
 If you have autoscaling based on hardware utilization enabled, these are the metrics that determine your autoscaling behaviour. You can
 read more about autoscaling [here](./autoscaling#scalingbasedonhardwareutilization)
 
@@ -5,7 +5,7 @@ utilization. By leveraging autoscaling, you can seamlessly handle varying worklo
 
 You can find the autoscaling setting for you endpoints under the "Settings tab" on the Inference Endpoint card.
 
-![settings](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/autoscaling/settings.png)
+![settings](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/autoscaling/settings.png)
 
 <Tip>
 In the Analytics section of the guide, you can read more about how to track all the metrics mentioned in this documentation.