You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Max Running Request**: the max number of concurrent requests
21
21
-**Max Prefill Tokens** (per batch): the maximum number of tokens that can be processed in a single prefill operation. This controls the batch size for the prefill phase and helps manage memory usage during prompt processing.
@@ -31,7 +31,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
31
31
For more advanced configuration you can pass any of the [Server Arguments that SGlang supports](https://docs.sglang.ai/backend/server_arguments.html)
32
32
as container arguments. For example changing the `schedule-policy` to `lpm` would look like this:
-**Quantization**: Which quantization method, if any, to use for the model.
22
22
-**Max Number of Tokens (per query)**: Changes the maximum amount of tokens a request can contain.
@@ -54,7 +54,7 @@ You can find the models that are supported by TGI:
54
54
- A selection of popular models in the [Inference Endpoints Catalog](https://endpoints.huggingface.co/huggingface/catalog)
55
55
56
56
If a model is supported by TGI, the Inference Endpoints UI will indicate this by disabling/enabling the selection under `Container Type` configuration.
-**Max Number of Sequences**: The maximum number of sequences (requests) that can be processed together in a single batch. Controls
18
18
the batch size by sequence count, affecting throughput and memory usage. For example, if max_num_seqs=8, up to 8 different prompts can
@@ -27,7 +27,7 @@ and "fp8_e4m3". Using lower precision types can reduce memory usage but may slig
27
27
For more advanced configuration you can pass any of the [Engine Arguments that vLLM supports](https://docs.vllm.ai/en/stable/api/vllm/engine/arg_utils.html#vllm.engine.arg_utils.EngineArgs)
28
28
as container arguments. For example changing the `enable_lora` to `true` would look like this:
0 commit comments