|
| 1 | +# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output. |
| 6 | + |
| 7 | +The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +* GPU: NVIDIA Blackwell Architecture |
| 12 | +* OS: Linux |
| 13 | +* Drivers: CUDA Driver 575 or Later |
| 14 | +* Docker with NVIDIA Container Toolkit installed |
| 15 | +* Python3 and python3-pip (Optional, for accuracy evaluation only) |
| 16 | + |
| 17 | +## Models |
| 18 | + |
| 19 | +* MXFP4 model: [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b) |
| 20 | + |
| 21 | + |
| 22 | +## MoE Backend Support Matrix |
| 23 | + |
| 24 | +There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends. |
| 25 | + |
| 26 | +| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | |
| 27 | +|------------|------------------|------------------|-------------|----------------| |
| 28 | +| B200/GB200 | MXFP8 | MXFP4 | TRTLLM | Low Latency | |
| 29 | +| B200/GB200 | MXFP8 | MXFP4 | CUTLASS | Max Throughput | |
| 30 | + |
| 31 | +The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model. |
| 32 | + |
| 33 | +## Deployment Steps |
| 34 | + |
| 35 | +### Run Docker Container |
| 36 | + |
| 37 | +Run the docker container using the TensorRT-LLM NVIDIA NGC image. |
| 38 | + |
| 39 | +```shell |
| 40 | +docker run --rm -it \ |
| 41 | +--ipc=host \ |
| 42 | +--gpus all \ |
| 43 | +-p 8000:8000 \ |
| 44 | +-v ~/.cache:/root/.cache:rw \ |
| 45 | +--name tensorrt_llm \ |
| 46 | +nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ |
| 47 | +/bin/bash |
| 48 | +``` |
| 49 | + |
| 50 | +Note: |
| 51 | + |
| 52 | +* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`. |
| 53 | +* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths. |
| 54 | +* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host |
| 55 | +* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support. |
| 56 | + |
| 57 | +If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. |
| 58 | + |
| 59 | +### Creating the TRT-LLM Server config |
| 60 | + |
| 61 | +We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings. |
| 62 | + |
| 63 | +For low-latency with `TRTLLM` MOE backend: |
| 64 | + |
| 65 | +```shell |
| 66 | +EXTRA_LLM_API_FILE=/tmp/config.yml |
| 67 | + |
| 68 | +cat << EOF > ${EXTRA_LLM_API_FILE} |
| 69 | +enable_attention_dp: false |
| 70 | +cuda_graph_config: |
| 71 | + enable_padding: true |
| 72 | + max_batch_size: 128 |
| 73 | +moe_config: |
| 74 | + backend: TRTLLM |
| 75 | +EOF |
| 76 | +``` |
| 77 | + |
| 78 | +For max-throughput with `CUTLASS` MOE backend: |
| 79 | + |
| 80 | +```shell |
| 81 | +EXTRA_LLM_API_FILE=/tmp/config.yml |
| 82 | + |
| 83 | +cat << EOF > ${EXTRA_LLM_API_FILE} |
| 84 | +enable_attention_dp: true |
| 85 | +cuda_graph_config: |
| 86 | + enable_padding: true |
| 87 | + max_batch_size: 128 |
| 88 | +moe_config: |
| 89 | + backend: CUTLASS |
| 90 | +EOF |
| 91 | +``` |
| 92 | + |
| 93 | +### Launch the TRT-LLM Server |
| 94 | + |
| 95 | +Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section. |
| 96 | + |
| 97 | +```shell |
| 98 | +trtllm-serve openai/gpt-oss-120b \ |
| 99 | + --host 0.0.0.0 \ |
| 100 | + --port 8000 \ |
| 101 | + --backend pytorch \ |
| 102 | + --max_batch_size 128 \ |
| 103 | + --max_num_tokens 16384 \ |
| 104 | + --max_seq_len 2048 \ |
| 105 | + --kv_cache_free_gpu_memory_fraction 0.9 \ |
| 106 | + --tp_size 8 \ |
| 107 | + --ep_size 8 \ |
| 108 | + --trust_remote_code \ |
| 109 | + --extra_llm_api_options ${EXTRA_LLM_API_FILE} |
| 110 | +``` |
| 111 | + |
| 112 | +After the server is set up, the client can now send prompt requests to the server and receive results. |
| 113 | + |
| 114 | +### Configs and Parameters |
| 115 | + |
| 116 | +These options are used directly on the command line when you start the `trtllm-serve` process. |
| 117 | + |
| 118 | +#### `--tp_size` |
| 119 | + |
| 120 | +* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. |
| 121 | + |
| 122 | +#### `--ep_size` |
| 123 | + |
| 124 | +* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models. |
| 125 | + |
| 126 | +#### `--kv_cache_free_gpu_memory_fraction` |
| 127 | + |
| 128 | +* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors. |
| 129 | +* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower. |
| 130 | + |
| 131 | +#### `--backend pytorch` |
| 132 | + |
| 133 | +* **Description:** Tells TensorRT-LLM to use the **pytorch** backend. |
| 134 | + |
| 135 | +#### `--max_batch_size` |
| 136 | + |
| 137 | +* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. |
| 138 | + |
| 139 | +#### `--max_num_tokens` |
| 140 | + |
| 141 | +* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch. |
| 142 | + |
| 143 | +#### `--max_seq_len` |
| 144 | + |
| 145 | +* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. |
| 146 | + |
| 147 | +#### `--trust_remote_code` |
| 148 | + |
| 149 | +* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. |
| 150 | + |
| 151 | + |
| 152 | +#### Extra LLM API Options (YAML Configuration) |
| 153 | + |
| 154 | +These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument. |
| 155 | + |
| 156 | +#### `cuda_graph_config` |
| 157 | + |
| 158 | +* **Description**: A section for configuring CUDA graphs to optimize performance. |
| 159 | + |
| 160 | +* **Options**: |
| 161 | + |
| 162 | + * `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance. |
| 163 | + |
| 164 | + **Default**: `false` |
| 165 | + |
| 166 | + * `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created. |
| 167 | + |
| 168 | + **Default**: `0` |
| 169 | + |
| 170 | + **Recommendation**: Set this to the same value as the `--max_batch_size` command-line option. |
| 171 | + |
| 172 | +#### `moe_config` |
| 173 | + |
| 174 | +* **Description**: Configuration for Mixture-of-Experts (MoE) models. |
| 175 | + |
| 176 | +* **Options**: |
| 177 | + |
| 178 | + * `backend`: The backend to use for MoE operations. |
| 179 | + **Default**: `CUTLASS` |
| 180 | + |
| 181 | +See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `extra_llm_api_options`. |
| 182 | + |
| 183 | +## Testing API Endpoint |
| 184 | + |
| 185 | +### Basic Test |
| 186 | + |
| 187 | +Start a new terminal on the host to test the TensorRT-LLM server you just launched. |
| 188 | + |
| 189 | +You can query the health/readiness of the server using: |
| 190 | + |
| 191 | +```shell |
| 192 | +curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health" |
| 193 | +``` |
| 194 | + |
| 195 | +When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation. |
| 196 | + |
| 197 | +After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. |
| 198 | + |
| 199 | +```shell |
| 200 | +curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ |
| 201 | + "model": "openai/gpt-oss-120b", |
| 202 | + "messages": [ |
| 203 | + { |
| 204 | + "role": "user", |
| 205 | + "content": "Where is New York?" |
| 206 | + } |
| 207 | + ], |
| 208 | + "max_tokens": 1024, |
| 209 | + "top_p": 1.0 |
| 210 | +}' -w "\n" |
| 211 | +``` |
| 212 | + |
| 213 | +Here is an example response, showing that the TRT-LLM server reasons and answers the questions. |
| 214 | + |
| 215 | +TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged. |
| 216 | + |
| 217 | +```json |
| 218 | +{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null} |
| 219 | +``` |
| 220 | + |
| 221 | +### Troubleshooting Tips |
| 222 | + |
| 223 | +* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`. |
| 224 | +* Ensure your model checkpoints are compatible with the expected format. |
| 225 | +* For performance issues, check GPU utilization with nvidia-smi while the server is running. |
| 226 | +* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed. |
| 227 | +* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application. |
| 228 | + |
| 229 | +### Running Evaluations to Verify Accuracy (Optional) |
| 230 | + |
| 231 | +We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval). |
| 232 | + |
| 233 | +TODO(@Binghan Chen): Add instructions for running gpt-oss-eval. |
| 234 | + |
| 235 | +## Benchmarking Performance |
| 236 | + |
| 237 | +To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. |
| 238 | + |
| 239 | +```shell |
| 240 | +cat <<'EOF' > bench.sh |
| 241 | +#!/usr/bin/env bash |
| 242 | +set -euo pipefail |
| 243 | +
|
| 244 | +concurrency_list="32 64 128 256 512 1024 2048 4096" |
| 245 | +multi_round=5 |
| 246 | +isl=1024 |
| 247 | +osl=1024 |
| 248 | +result_dir=/tmp/gpt_oss_output |
| 249 | +
|
| 250 | +for concurrency in ${concurrency_list}; do |
| 251 | + num_prompts=$((concurrency * multi_round)) |
| 252 | + python -m tensorrt_llm.serve.scripts.benchmark_serving \ |
| 253 | + --model openai/gpt-oss-120b \ |
| 254 | + --backend openai \ |
| 255 | + --dataset-name "random" \ |
| 256 | + --random-input-len ${isl} \ |
| 257 | + --random-output-len ${osl} \ |
| 258 | + --random-prefix-len 0 \ |
| 259 | + --random-ids \ |
| 260 | + --num-prompts ${num_prompts} \ |
| 261 | + --max-concurrency ${concurrency} \ |
| 262 | + --ignore-eos \ |
| 263 | + --tokenize-on-client \ |
| 264 | + --percentile-metrics "ttft,tpot,itl,e2el" |
| 265 | +done |
| 266 | +EOF |
| 267 | +chmod +x bench.sh |
| 268 | +``` |
| 269 | + |
| 270 | +If you want to save the results to a file add the following options. |
| 271 | + |
| 272 | +```shell |
| 273 | +--save-result \ |
| 274 | +--result-dir "${result_dir}" \ |
| 275 | +--result-filename "concurrency_${concurrency}.json" |
| 276 | +``` |
| 277 | + |
| 278 | +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. |
| 279 | + |
| 280 | +Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script. |
| 281 | + |
| 282 | +```shell |
| 283 | +./bench.sh |
| 284 | +``` |
| 285 | + |
| 286 | +Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations. |
| 287 | + |
| 288 | +``` |
| 289 | +============ Serving Benchmark Result ============ |
| 290 | +Successful requests: 16 |
| 291 | +Benchmark duration (s): 17.66 |
| 292 | +Total input tokens: 16384 |
| 293 | +Total generated tokens: 16384 |
| 294 | +Request throughput (req/s): [result] |
| 295 | +Output token throughput (tok/s): [result] |
| 296 | +Total Token throughput (tok/s): [result] |
| 297 | +User throughput (tok/s): [result] |
| 298 | +---------------Time to First Token---------------- |
| 299 | +Mean TTFT (ms): [result] |
| 300 | +Median TTFT (ms): [result] |
| 301 | +P99 TTFT (ms): [result] |
| 302 | +-----Time per Output Token (excl. 1st token)------ |
| 303 | +Mean TPOT (ms): [result] |
| 304 | +Median TPOT (ms): [result] |
| 305 | +P99 TPOT (ms): [result] |
| 306 | +---------------Inter-token Latency---------------- |
| 307 | +Mean ITL (ms): [result] |
| 308 | +Median ITL (ms): [result] |
| 309 | +P99 ITL (ms): [result] |
| 310 | +----------------End-to-end Latency---------------- |
| 311 | +Mean E2EL (ms): [result] |
| 312 | +Median E2EL (ms): [result] |
| 313 | +P99 E2EL (ms): [result] |
| 314 | +================================================== |
| 315 | +``` |
| 316 | + |
| 317 | +### Key Metrics |
| 318 | + |
| 319 | +* Median Time to First Token (TTFT) |
| 320 | + * The typical time elapsed from when a request is sent until the first output token is generated. |
| 321 | +* Median Time Per Output Token (TPOT) |
| 322 | + * The typical time required to generate each token *after* the first one. |
| 323 | +* Median Inter-Token Latency (ITL) |
| 324 | + * The typical time delay between the completion of one token and the completion of the next. |
| 325 | +* Median End-to-End Latency (E2EL) |
| 326 | + * The typical total time from when a request is submitted until the final token of the response is received. |
| 327 | +* Total Token Throughput |
| 328 | + * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens. |
0 commit comments