|
| 1 | +# Run benchmarking with `trtllm-serve` |
| 2 | + |
| 3 | +TensorRT-LLM provides the OpenAI-compatiable API via `trtllm-serve` command. |
| 4 | +A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference). |
| 5 | + |
| 6 | +This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B: |
| 7 | + * Methodology Introduction |
| 8 | + * Launch the OpenAI-Compatibale Server with NGC container |
| 9 | + * Run the performance benchmark |
| 10 | + * Using `extra_llm_api_options` |
| 11 | + |
| 12 | + |
| 13 | +## Methodology Introduction |
| 14 | + |
| 15 | +The overall performance benchmarking involves: |
| 16 | + 1. Launch the OpenAI-compatible service with `trtllm-serve` |
| 17 | + 2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) |
| 18 | + |
| 19 | + |
| 20 | +## Launch the NGC container |
| 21 | + |
| 22 | +TensorRT-LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags). |
| 23 | + |
| 24 | +You can launch the container using the following command: |
| 25 | + |
| 26 | +```bash |
| 27 | +docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release |
| 28 | +``` |
| 29 | + |
| 30 | +## Start the trtllm-serve service |
| 31 | +> [!WARNING] |
| 32 | +> The commands and configurations presented in this document are for illustrative purposes only. |
| 33 | +> They serve as examples and may not deliver the optimal performance for your specific use case. |
| 34 | +> Users are encouraged to tune the parameters based on their hardware and workload. |
| 35 | +For benchmarking purposes, first create a bash script using the following code and name it start.sh. |
| 36 | +```bash |
| 37 | +#! /bin/bash |
| 38 | +model_path=/path/to/llama3.1_70B |
| 39 | +extra_llm_api_file=/tmp/extra-llm-api-config.yml |
| 40 | + |
| 41 | +cat << EOF > ${extra_llm_api_file} |
| 42 | +enable_attention_dp: false |
| 43 | +print_iter_log: true |
| 44 | +cuda_graph_config: |
| 45 | + enable_padding: true |
| 46 | + max_batch_size: 1024 |
| 47 | +kv_cache_config: |
| 48 | + dtype: fp8 |
| 49 | +EOF |
| 50 | + |
| 51 | +trtllm-serve ${model_path} \ |
| 52 | + --max_batch_size 1024 \ |
| 53 | + --max_num_tokens 2048 \ |
| 54 | + --max_seq_len 1024 \ |
| 55 | + --kv_cache_free_gpu_memory_fraction 0.9 \ |
| 56 | + --tp_size 1 \ |
| 57 | + --ep_size 1 \ |
| 58 | + --trust_remote_code \ |
| 59 | + --extra_llm_api_options ${extra_llm_api_file} |
| 60 | +``` |
| 61 | +> [!NOTE] |
| 62 | +> The trtllm-llmapi-launch is a script that launches the LLM-API code on |
| 63 | +> Slurm-like systems, and can support multi-node and multi-GPU setups. |
| 64 | +> e.g, trtllm-llmapi-launch trtllm-serve ..... |
| 65 | +
|
| 66 | +Run the start.sh script in the **background** with the following command: |
| 67 | + |
| 68 | +```bash |
| 69 | +bash -x start.sh & |
| 70 | +``` |
| 71 | + |
| 72 | +Once the serving is set up, it will generate the output log as shown below. |
| 73 | +```bash |
| 74 | +INFO: Started server process [80833] |
| 75 | +INFO: Waiting for application startup. |
| 76 | +INFO: Application startup complete. |
| 77 | +INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit) |
| 78 | +``` |
| 79 | + |
| 80 | +## Run the benchmark |
| 81 | + |
| 82 | +Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh. |
| 83 | + |
| 84 | +```bash |
| 85 | +concurrency_list="1 2 4 8 16 32 64 128 256" |
| 86 | +multi_round=5 |
| 87 | +isl=1024 |
| 88 | +osl=1024 |
| 89 | +result_dir=/tmp/llama3.1_output |
| 90 | +model_path=/path/to/llama3.1_70B |
| 91 | + |
| 92 | +for concurrency in ${concurrency_list}; do |
| 93 | + num_prompts=$((concurrency * multi_round)) |
| 94 | + python -m tensorrt_llm.serve.scripts.benchmark_serving \ |
| 95 | + --model ${model_path} \ |
| 96 | + --backend openai \ |
| 97 | + --dataset-name "random" \ |
| 98 | + --random-input-len ${isl} \ |
| 99 | + --random-output-len ${osl} \ |
| 100 | + --random-prefix-len 0 \ |
| 101 | + --num-prompts ${num_prompts} \ |
| 102 | + --max-concurrency ${concurrency} \ |
| 103 | + --ignore-eos \ |
| 104 | + --save-result \ |
| 105 | + --result-dir "${result_dir}" \ |
| 106 | + --result-filename "concurrency_${concurrency}.json" \ |
| 107 | + --percentile-metrics "ttft,tpot,itl,e2el" |
| 108 | +done |
| 109 | +``` |
| 110 | + |
| 111 | +Then we can run the benchmark using the command below. |
| 112 | + |
| 113 | +```bash |
| 114 | +bash -x bench.sh &> output_bench.log |
| 115 | +``` |
| 116 | + |
| 117 | +Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary. |
| 118 | + |
| 119 | +``` |
| 120 | +============ Serving Benchmark Result ============ |
| 121 | +Successful requests: 1 |
| 122 | +Benchmark duration (s): 1.64 |
| 123 | +Total input tokens: 1024 |
| 124 | +Total generated tokens: 1024 |
| 125 | +Request throughput (req/s): 0.61 |
| 126 | +Output token throughput (tok/s): 622.56 |
| 127 | +Total Token throughput (tok/s): 1245.12 |
| 128 | +User throughput (tok/s): 623.08 |
| 129 | +Mean Request AR: 0.9980 |
| 130 | +Median Request AR: 0.9980 |
| 131 | +---------------Time to First Token---------------- |
| 132 | +Mean TTFT (ms): 12.83 |
| 133 | +Median TTFT (ms): 12.83 |
| 134 | +P99 TTFT (ms): 12.83 |
| 135 | +-----Time per Output Token (excl. 1st token)------ |
| 136 | +Mean TPOT (ms): 1.59 |
| 137 | +Median TPOT (ms): 1.59 |
| 138 | +P99 TPOT (ms): 1.59 |
| 139 | +---------------Inter-token Latency---------------- |
| 140 | +Mean ITL (ms): 1.59 |
| 141 | +Median ITL (ms): 1.59 |
| 142 | +P99 ITL (ms): 1.77 |
| 143 | +----------------End-to-end Latency---------------- |
| 144 | +Mean E2EL (ms): 1643.44 |
| 145 | +Median E2EL (ms): 1643.44 |
| 146 | +P99 E2EL (ms): 1643.44 |
| 147 | +================================================== |
| 148 | +``` |
| 149 | + |
| 150 | +### Key Metrics |
| 151 | + |
| 152 | +* Median Time to First Token (TTFT) |
| 153 | + * The typical time elapsed from when a request is sent until the first output token is generated. |
| 154 | +* Median Time Per Output Token (TPOT) |
| 155 | + * The typical time required to generate each token *after* the first one. |
| 156 | +* Median Inter-Token Latency (ITL) |
| 157 | + * The typical time delay between the completion of one token and the completion of the next. |
| 158 | +* Median End-to-End Latency (E2EL) |
| 159 | + * The typical total time from when a request is submitted until the final token of the response is received. |
| 160 | +* Total Token Throughput |
| 161 | + * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens. |
| 162 | + |
| 163 | +## About `extra_llm_api_options` |
| 164 | + trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve. |
| 165 | + Generally, We create a YAML file that contains various performance switches. |
| 166 | + e.g |
| 167 | + ```yaml |
| 168 | + cuda_graph_config: |
| 169 | + padding_enabled: true |
| 170 | + print_iter_log: true |
| 171 | + kv_cache_dtype: fp8 |
| 172 | + enable_attention_dp: true |
| 173 | + ``` |
| 174 | +
|
| 175 | +The following is a list of common performance switches. |
| 176 | +#### `kv_cache_config` |
| 177 | + |
| 178 | + **Description**: A section for configuring the Key-Value (KV) cache. |
| 179 | + |
| 180 | + **Options**: |
| 181 | + |
| 182 | +  dtype: Sets the data type for the KV cache. |
| 183 | + |
| 184 | +  **Default**: auto (uses the data type specified in the model checkpoint). |
| 185 | + |
| 186 | +#### `cuda_graph_config` |
| 187 | + |
| 188 | + **Description**: A section for configuring CUDA graphs to optimize performance. |
| 189 | + |
| 190 | + **Options**: |
| 191 | + |
| 192 | +  enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance. |
| 193 | + |
| 194 | +  **Default**: false |
| 195 | + |
| 196 | +  max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created. |
| 197 | + |
| 198 | +  **Default**: 0 |
| 199 | + |
| 200 | +  **Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option. |
| 201 | + |
| 202 | +  batch\_sizes: A specific list of batch sizes to create CUDA graphs for. |
| 203 | + |
| 204 | +  **Default**: None |
| 205 | + |
| 206 | +#### `moe_config` |
| 207 | + |
| 208 | + **Description**: Configuration for Mixture-of-Experts (MoE) models. |
| 209 | + |
| 210 | + **Options**: |
| 211 | + |
| 212 | +  backend: The backend to use for MoE operations. |
| 213 | + |
| 214 | +  **Default**: CUTLASS |
| 215 | + |
| 216 | +#### `attention_backend` |
| 217 | + |
| 218 | + **Description**: The backend to use for attention calculations. |
| 219 | + |
| 220 | + **Default**: TRTLLM |
| 221 | + |
| 222 | +See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the extra\_llm\_api\_options`.` |
0 commit comments