Skip to content

Commit e022cb4

Browse files
committed
fix links
1 parent 598a96b commit e022cb4

File tree

17 files changed

+56
-56
lines changed

17 files changed

+56
-56
lines changed
147 KB
Loading
277 KB
Loading

docs/source/about.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Inference Endpoints brings all these pieces together into a single managed servi
1818
inference engine, and Inference Endpoints takes care of the rest—provisioning infrastructure, deploying your model, and making it
1919
accessible via a simple API. This lets you focus on building your application, while we handle the complexity of production AI deployment.
2020

21-
![about](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/about.png)
21+
![about](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/about.png)
2222

2323
## Inference Enginges
2424

docs/source/engines/custom_container.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ it on Inference Endpoints. You can use public images like `tensorflow/serving:2.
55
[Docker Hub](https://hub.docker.com/), [AWS ECR](https://aws.amazon.com/ecr/?nc1=h_ls),
66
[Azure ACR](https://azure.microsoft.com/de-de/services/container-registry/), or [Google GCR](https://cloud.google.com/container-registry?hl=de).
77

8-
![custom container](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/custom_container/custom-container.png)
8+
![custom container](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/custom_container/custom-container.png)
99

1010
The [creation flow](/docs/inference-endpoints/guides/create_endpoint) of your image artifacts from a custom image is the same as the
1111
base image. This means Inference Endpoints will create a unique image artifact derived from your provided image, including all model

docs/source/engines/sglang.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy
1515

1616
## Configuration
1717

18-
![sglang](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/sglang/sglang.png)
18+
![sglang](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/sglang/sglang.png)
1919

2020
- **Max Running Request**: the max number of concurrent requests
2121
- **Max Prefill Tokens** (per batch): the maximum number of tokens that can be processed in a single prefill operation. This controls the batch size for the prefill phase and helps manage memory usage during prompt processing.
@@ -31,7 +31,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
3131
For more advanced configuration you can pass any of the [Server Arguments that SGlang supports](https://docs.sglang.ai/backend/server_arguments.html)
3232
as container arguments. For example changing the `schedule-policy` to `lpm` would look like this:
3333

34-
![sglang-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/sglang/sglang-advanced.png)
34+
![sglang-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/sglang/sglang-advanced.png)
3535

3636
## Supported models
3737

docs/source/engines/tei.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Key Features:
1313

1414
## Configuration
1515

16-
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tei/tei.png)
16+
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tei/tei.png)
1717

1818
- **Max Tokens (per batch)**: Number of tokens that can be added to a batch before forcing queries to wait in the internal queue.
1919
- **Max Concurrent Requests**: The maximum number of requests that the server can handle at once.

docs/source/engines/tgi.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ the container URL](https://raw.githubusercontent.com/not-here)
1616
When selecting a model to deploy, the Inference Endpoints UI automatically checks whether a model is supported by TGI. If it is, you'll see
1717
the option presented under `Container Configuration` where you can change the following settings:
1818

19-
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tgi/tgi_config.png)
19+
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tgi/tgi_config.png)
2020

2121
- **Quantization**: Which quantization method, if any, to use for the model.
2222
- **Max Number of Tokens (per query)**: Changes the maximum amount of tokens a request can contain.
@@ -54,7 +54,7 @@ You can find the models that are supported by TGI:
5454
- A selection of popular models in the [Inference Endpoints Catalog](https://endpoints.huggingface.co/huggingface/catalog)
5555

5656
If a model is supported by TGI, the Inference Endpoints UI will indicate this by disabling/enabling the selection under `Container Type` configuration.
57-
![selection](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/tgi/tgi_selection.png)
57+
![selection](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/tgi/tgi_selection.png)
5858

5959
## References
6060

docs/source/engines/vllm.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ batching, and decoding—all wrapped in a production-ready server. For most use
1212

1313
## Configuration
1414

15-
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/vllm/vllm_config.png)
15+
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/vllm/vllm_config.png)
1616

1717
- **Max Number of Sequences**: The maximum number of sequences (requests) that can be processed together in a single batch. Controls
1818
the batch size by sequence count, affecting throughput and memory usage. For example, if max_num_seqs=8, up to 8 different prompts can
@@ -27,7 +27,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
2727
For more advanced configuration you can pass any of the [Engine Arguments that vLLM supports](https://docs.vllm.ai/en/stable/api/vllm/engine/arg_utils.html#vllm.engine.arg_utils.EngineArgs)
2828
as container arguments. For example changing the `enable_lora` to `true` would look like this:
2929

30-
![vllm-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/vllm/vllm-advanced.png)
30+
![vllm-advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/vllm/vllm-advanced.png)
3131

3232
## Supported models
3333

docs/source/guides/analytics.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ The Analytics page is like the control center for your deployed models. It tells
44
calling your models, about hardware usage, latencies, and much more. In this documentation we'll dive into what each metric means and
55
how to analyze the graphs.
66

7-
![intro](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/1-intro.png)
7+
![intro](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/1-intro.png)
88

99
In the top bar, you can configure for which time frame you'll inspect the metrics, this setting affects all graphs on the page. You can choose
1010
between any of the existing settings from the drop down, or click-and-drag over any graph for a custom timeframe. You can also enable/disable
1111
auto refresh or view the metrics per replica or for all.
1212

13-
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/2-config.png)
13+
![config](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/2-config.png)
1414

1515
## Understanding the graphs
1616

@@ -28,7 +28,7 @@ classes, but by switching the toggle you can view them by individual status. As
2828
We recommend checking the [MDN web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status) for more information on individual
2929
status codes.
3030

31-
![http](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/3-http-reqs.png)
31+
![http](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/3-http-reqs.png)
3232

3333
### Pending Requests
3434

@@ -37,7 +37,7 @@ being processed. If this metric increases too much, it means that your requests
3737
to finish. In this case you should consider increasing your number of replicas or alternatively use autoscaling, you can read more about
3838
it in the [autoscaling guide](./autoscaling#scalingbasedonpendingrequests(betafeature))
3939

40-
![pending](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/4-pending-reqs.png)
40+
![pending](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/4-pending-reqs.png)
4141

4242
### Latency Distribution
4343

@@ -52,19 +52,19 @@ Usually a good metric is also to look at how big the difference is between the m
5252
uniform the latency is, whereas if the difference is large, it means that the users of your Inference Endpoint have in general a fast response but
5353
the worst case latencies can be long.
5454

55-
![latency](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/5-latency.png)
55+
![latency](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/5-latency.png)
5656

5757
### Replica Status
5858

5959
In the replica status graph, you'll see in the basic view how many running replicas you have during a point in time. The red line shows
6060
what your current maximum replicas setting is.
6161

62-
![status](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/6-status.png)
62+
![status](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/6-status.png)
6363

6464
If you toggle the advanced setting, you'll instead see the different statuses of the individual replicas, going from *pending* all the way
6565
to *running*. This is very useful to get a sense of how long it takes for an endpoint to actually become ready to serve requests.
6666

67-
![advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/7-status-advanced.png)
67+
![advanced](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/7-status-advanced.png)
6868

6969
### Hardware Usage
7070

@@ -75,7 +75,7 @@ The last four graphs are dedicated for the hardware usage. You'll find:
7575
- GPU usage: How much of the GPU's processing power is being used.
7676
- GPU Memory (VRAM) usage: How much GPU memory is being used.
7777

78-
![usage](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/analytics/8-usage.png)
78+
![usage](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/analytics/8-usage.png)
7979

8080
If you have autoscaling based on hardware utilization enabled, these are the metrics that determine your autoscaling behaviour. You can
8181
read more about autoscaling [here](./autoscaling#scalingbasedonhardwareutilization)

docs/source/guides/autoscaling.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ utilization. By leveraging autoscaling, you can seamlessly handle varying worklo
55

66
You can find the autoscaling setting for you endpoints under the "Settings tab" on the Inference Endpoint card.
77

8-
![settings](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/update/large-rewrite/assets/autoscaling/settings.png)
8+
![settings](https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/autoscaling/settings.png)
99

1010
<Tip>
1111
In the Analytics section of the guide, you can read more about how to track all the metrics mentioned in this documentation.

0 commit comments

Comments
 (0)