Skip to content

Commit d94cc3f

Browse files
authored
[TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site (#7143)
Signed-off-by: Dongfeng Yu
1 parent 898f37f commit d94cc3f

File tree

2 files changed

+329
-0
lines changed

2 files changed

+329
-0
lines changed
Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
2+
3+
## Introduction
4+
5+
This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.
6+
7+
The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving.
8+
9+
## Prerequisites
10+
11+
* GPU: NVIDIA Blackwell Architecture
12+
* OS: Linux
13+
* Drivers: CUDA Driver 575 or Later
14+
* Docker with NVIDIA Container Toolkit installed
15+
* Python3 and python3-pip (Optional, for accuracy evaluation only)
16+
17+
## Models
18+
19+
* MXFP4 model: [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b)
20+
21+
22+
## MoE Backend Support Matrix
23+
24+
There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends.
25+
26+
| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case |
27+
|------------|------------------|------------------|-------------|----------------|
28+
| B200/GB200 | MXFP8 | MXFP4 | TRTLLM | Low Latency |
29+
| B200/GB200 | MXFP8 | MXFP4 | CUTLASS | Max Throughput |
30+
31+
The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model.
32+
33+
## Deployment Steps
34+
35+
### Run Docker Container
36+
37+
Run the docker container using the TensorRT-LLM NVIDIA NGC image.
38+
39+
```shell
40+
docker run --rm -it \
41+
--ipc=host \
42+
--gpus all \
43+
-p 8000:8000 \
44+
-v ~/.cache:/root/.cache:rw \
45+
--name tensorrt_llm \
46+
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
47+
/bin/bash
48+
```
49+
50+
Note:
51+
52+
* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
53+
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
54+
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
55+
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
56+
57+
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
58+
59+
### Creating the TRT-LLM Server config
60+
61+
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
62+
63+
For low-latency with `TRTLLM` MOE backend:
64+
65+
```shell
66+
EXTRA_LLM_API_FILE=/tmp/config.yml
67+
68+
cat << EOF > ${EXTRA_LLM_API_FILE}
69+
enable_attention_dp: false
70+
cuda_graph_config:
71+
enable_padding: true
72+
max_batch_size: 128
73+
moe_config:
74+
backend: TRTLLM
75+
EOF
76+
```
77+
78+
For max-throughput with `CUTLASS` MOE backend:
79+
80+
```shell
81+
EXTRA_LLM_API_FILE=/tmp/config.yml
82+
83+
cat << EOF > ${EXTRA_LLM_API_FILE}
84+
enable_attention_dp: true
85+
cuda_graph_config:
86+
enable_padding: true
87+
max_batch_size: 128
88+
moe_config:
89+
backend: CUTLASS
90+
EOF
91+
```
92+
93+
### Launch the TRT-LLM Server
94+
95+
Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
96+
97+
```shell
98+
trtllm-serve openai/gpt-oss-120b \
99+
--host 0.0.0.0 \
100+
--port 8000 \
101+
--backend pytorch \
102+
--max_batch_size 128 \
103+
--max_num_tokens 16384 \
104+
--max_seq_len 2048 \
105+
--kv_cache_free_gpu_memory_fraction 0.9 \
106+
--tp_size 8 \
107+
--ep_size 8 \
108+
--trust_remote_code \
109+
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
110+
```
111+
112+
After the server is set up, the client can now send prompt requests to the server and receive results.
113+
114+
### Configs and Parameters
115+
116+
These options are used directly on the command line when you start the `trtllm-serve` process.
117+
118+
#### `--tp_size`
119+
120+
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
121+
122+
#### `--ep_size`
123+
124+
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
125+
126+
#### `--kv_cache_free_gpu_memory_fraction`
127+
128+
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
129+
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
130+
131+
#### `--backend pytorch`
132+
133+
* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
134+
135+
#### `--max_batch_size`
136+
137+
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
138+
139+
#### `--max_num_tokens`
140+
141+
* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
142+
143+
#### `--max_seq_len`
144+
145+
* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
146+
147+
#### `--trust_remote_code`
148+
149+
* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
150+
151+
152+
#### Extra LLM API Options (YAML Configuration)
153+
154+
These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
155+
156+
#### `cuda_graph_config`
157+
158+
* **Description**: A section for configuring CUDA graphs to optimize performance.
159+
160+
* **Options**:
161+
162+
* `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.
163+
164+
**Default**: `false`
165+
166+
* `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.
167+
168+
**Default**: `0`
169+
170+
**Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.
171+
172+
#### `moe_config`
173+
174+
* **Description**: Configuration for Mixture-of-Experts (MoE) models.
175+
176+
* **Options**:
177+
178+
* `backend`: The backend to use for MoE operations.
179+
**Default**: `CUTLASS`
180+
181+
See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `extra_llm_api_options`.
182+
183+
## Testing API Endpoint
184+
185+
### Basic Test
186+
187+
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
188+
189+
You can query the health/readiness of the server using:
190+
191+
```shell
192+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
193+
```
194+
195+
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
196+
197+
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
198+
199+
```shell
200+
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
201+
"model": "openai/gpt-oss-120b",
202+
"messages": [
203+
{
204+
"role": "user",
205+
"content": "Where is New York?"
206+
}
207+
],
208+
"max_tokens": 1024,
209+
"top_p": 1.0
210+
}' -w "\n"
211+
```
212+
213+
Here is an example response, showing that the TRT-LLM server reasons and answers the questions.
214+
215+
TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
216+
217+
```json
218+
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
219+
```
220+
221+
### Troubleshooting Tips
222+
223+
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
224+
* Ensure your model checkpoints are compatible with the expected format.
225+
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
226+
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
227+
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
228+
229+
### Running Evaluations to Verify Accuracy (Optional)
230+
231+
We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
232+
233+
TODO(@Binghan Chen): Add instructions for running gpt-oss-eval.
234+
235+
## Benchmarking Performance
236+
237+
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
238+
239+
```shell
240+
cat <<'EOF' > bench.sh
241+
#!/usr/bin/env bash
242+
set -euo pipefail
243+
244+
concurrency_list="32 64 128 256 512 1024 2048 4096"
245+
multi_round=5
246+
isl=1024
247+
osl=1024
248+
result_dir=/tmp/gpt_oss_output
249+
250+
for concurrency in ${concurrency_list}; do
251+
num_prompts=$((concurrency * multi_round))
252+
python -m tensorrt_llm.serve.scripts.benchmark_serving \
253+
--model openai/gpt-oss-120b \
254+
--backend openai \
255+
--dataset-name "random" \
256+
--random-input-len ${isl} \
257+
--random-output-len ${osl} \
258+
--random-prefix-len 0 \
259+
--random-ids \
260+
--num-prompts ${num_prompts} \
261+
--max-concurrency ${concurrency} \
262+
--ignore-eos \
263+
--tokenize-on-client \
264+
--percentile-metrics "ttft,tpot,itl,e2el"
265+
done
266+
EOF
267+
chmod +x bench.sh
268+
```
269+
270+
If you want to save the results to a file add the following options.
271+
272+
```shell
273+
--save-result \
274+
--result-dir "${result_dir}" \
275+
--result-filename "concurrency_${concurrency}.json"
276+
```
277+
278+
For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
279+
280+
Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
281+
282+
```shell
283+
./bench.sh
284+
```
285+
286+
Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations.
287+
288+
```
289+
============ Serving Benchmark Result ============
290+
Successful requests: 16
291+
Benchmark duration (s): 17.66
292+
Total input tokens: 16384
293+
Total generated tokens: 16384
294+
Request throughput (req/s): [result]
295+
Output token throughput (tok/s): [result]
296+
Total Token throughput (tok/s): [result]
297+
User throughput (tok/s): [result]
298+
---------------Time to First Token----------------
299+
Mean TTFT (ms): [result]
300+
Median TTFT (ms): [result]
301+
P99 TTFT (ms): [result]
302+
-----Time per Output Token (excl. 1st token)------
303+
Mean TPOT (ms): [result]
304+
Median TPOT (ms): [result]
305+
P99 TPOT (ms): [result]
306+
---------------Inter-token Latency----------------
307+
Mean ITL (ms): [result]
308+
Median ITL (ms): [result]
309+
P99 ITL (ms): [result]
310+
----------------End-to-end Latency----------------
311+
Mean E2EL (ms): [result]
312+
Median E2EL (ms): [result]
313+
P99 E2EL (ms): [result]
314+
==================================================
315+
```
316+
317+
### Key Metrics
318+
319+
* Median Time to First Token (TTFT)
320+
* The typical time elapsed from when a request is sent until the first output token is generated.
321+
* Median Time Per Output Token (TPOT)
322+
* The typical time required to generate each token *after* the first one.
323+
* Median Inter-Token Latency (ITL)
324+
* The typical time delay between the completion of one token and the completion of the next.
325+
* Median End-to-End Latency (E2EL)
326+
* The typical total time from when a request is submitted until the final token of the response is received.
327+
* Total Token Throughput
328+
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ Welcome to TensorRT-LLM's Documentation!
3838
deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
3939
deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
4040
deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
41+
deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
4142

4243

4344
.. toctree::

0 commit comments

Comments
 (0)