NVIDIA
diff --git a/‎docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/commands/trtllm-serve/index.rst
Lines changed: 9 additions & 0 deletions b/‎docs/source/commands/trtllm-serve/index.rst
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
Lines changed: 222 additions & 0 deletions b/‎docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md
Lines changed: 222 additions & 0 deletions
diff --git a/‎docs/source/commands/trtllm-serve.rst renamed to ‎docs/source/commands/trtllm-serve/trtllm-serve.rst
Lines changed: 3 additions & 20 deletions b/‎docs/source/commands/trtllm-serve.rst renamed to ‎docs/source/commands/trtllm-serve/trtllm-serve.rst
Lines changed: 3 additions & 20 deletions
diff --git a/‎docs/source/index.rst
Lines changed: 2 additions & 2 deletions b/‎docs/source/index.rst
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/installation/linux.md
Lines changed: 10 additions & 15 deletions b/‎docs/source/installation/linux.md
Lines changed: 10 additions & 15 deletions
@@ -503,7 +503,7 @@ Let's use some representative workloads to illustrate the performance impact wit
 </div>
 <p align="center"><sub><em>Figure 24: EP impact over MoE Group GEMM and EP communication</em></sub></p>
 In Figure 24, it can be observed that by increasing the EP size from 4 to 72, the MoE Group GEMM computation time gets reduced, while the EP communication time (for EP4/EP8 Reduce/Scatter is used, while for EP>8 All2All is used) stays almost constant.
-When the EP size increases from 18 to 32, the speed-up diminishes. We are working on optimizing it.
+When the EP size increases from 18 to 72, the speed-up diminishes. We are working on optimizing it.
 
 Next, let's use some representative workloads to understand the performance impact with EPLB.
 <div align="center">
 
@@ -0,0 +1,9 @@
+trtllm-serve
+=======================
+
+
+.. toctree::
+   :maxdepth: 1
+
+   trtllm-serve
+   run-benchmark-with-trtllm-serve
@@ -0,0 +1,222 @@
+# Run benchmarking with `trtllm-serve`
+
+TensorRT-LLM provides the OpenAI-compatiable API via `trtllm-serve` command.
+A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).
+
+This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B:
+ * Methodology Introduction
+ * Launch the OpenAI-Compatibale Server with NGC container
+ * Run the performance benchmark
+ * Using `extra_llm_api_options`
+
+
+## Methodology Introduction
+
+The overall performance benchmarking involves:
+   1. Launch the OpenAI-compatible service with `trtllm-serve`
+   2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
+
+
+## Launch the NGC container
+
+TensorRT-LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).
+
+You can launch the container using the following command:
+
+```bash
+docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release
+```
+
+## Start the trtllm-serve service
+> [!WARNING]
+> The commands and configurations presented in this document are for illustrative purposes only.
+> They serve as examples and may not deliver the optimal performance for your specific use case.
+> Users are encouraged to tune the parameters based on their hardware and workload.
+For benchmarking purposes, first create a bash script using the following code and name it start.sh.
+```bash
+#! /bin/bash
+model_path=/path/to/llama3.1_70B
+extra_llm_api_file=/tmp/extra-llm-api-config.yml
+
+cat << EOF > ${extra_llm_api_file}
+enable_attention_dp: false
+print_iter_log: true
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 1024
+kv_cache_config:
+  dtype: fp8
+EOF
+
+trtllm-serve ${model_path} \
+    --max_batch_size 1024 \
+    --max_num_tokens 2048 \
+    --max_seq_len 1024 \
+    --kv_cache_free_gpu_memory_fraction 0.9 \
+    --tp_size 1 \
+    --ep_size 1 \
+    --trust_remote_code \
+    --extra_llm_api_options ${extra_llm_api_file}
+```
+> [!NOTE]
+> The trtllm-llmapi-launch is a script that launches the LLM-API code on
+> Slurm-like systems, and can support multi-node and multi-GPU setups.
+> e.g, trtllm-llmapi-launch trtllm-serve .....
+
+Run the start.sh script in the **background** with the following command:
+
+```bash
+bash -x start.sh &
+```
+
+Once the serving is set up, it will generate the output log as shown below.
+```bash
+INFO:     Started server process [80833]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
+```
+
+## Run the benchmark
+
+Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
+
+```bash
+concurrency_list="1 2 4 8 16 32 64 128 256"
+multi_round=5
+isl=1024
+osl=1024
+result_dir=/tmp/llama3.1_output
+model_path=/path/to/llama3.1_70B
+
+for concurrency in ${concurrency_list}; do
+    num_prompts=$((concurrency * multi_round))
+    python -m tensorrt_llm.serve.scripts.benchmark_serving \
+        --model ${model_path} \
+        --backend openai \
+        --dataset-name "random" \
+        --random-input-len ${isl} \
+        --random-output-len ${osl} \
+        --random-prefix-len 0 \
+        --num-prompts ${num_prompts} \
+        --max-concurrency ${concurrency} \
+        --ignore-eos \
+        --save-result \
+        --result-dir "${result_dir}" \
+        --result-filename "concurrency_${concurrency}.json" \
+        --percentile-metrics "ttft,tpot,itl,e2el"
+done
+```
+
+Then we can run the benchmark using the command below.
+
+```bash
+bash -x bench.sh &> output_bench.log
+```
+
+Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
+
+```
+============ Serving Benchmark Result ============
+Successful requests:                     1
+Benchmark duration (s):                  1.64
+Total input tokens:                      1024
+Total generated tokens:                  1024
+Request throughput (req/s):              0.61
+Output token throughput (tok/s):         622.56
+Total Token throughput (tok/s):          1245.12
+User throughput (tok/s):                 623.08
+Mean Request AR:                         0.9980
+Median Request AR:                       0.9980
+---------------Time to First Token----------------
+Mean TTFT (ms):                          12.83
+Median TTFT (ms):                        12.83
+P99 TTFT (ms):                           12.83
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          1.59
+Median TPOT (ms):                        1.59
+P99 TPOT (ms):                           1.59
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           1.59
+Median ITL (ms):                         1.59
+P99 ITL (ms):                            1.77
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          1643.44
+Median E2EL (ms):                        1643.44
+P99 E2EL (ms):                           1643.44
+==================================================
+```
+
+### Key Metrics
+
+* Median Time to First Token (TTFT)
+  * The typical time elapsed from when a request is sent until the first output token is generated.
+* Median Time Per Output Token (TPOT)
+  * The typical time required to generate each token *after* the first one.
+* Median Inter-Token Latency (ITL)
+  * The typical time delay between the completion of one token and the completion of the next.
+* Median End-to-End Latency (E2EL)
+  * The typical total time from when a request is submitted until the final token of the response is received.
+* Total Token Throughput
+  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
+
+## About `extra_llm_api_options`
+   trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve.
+   Generally, We create a YAML file that contains various performance switches.
+   e.g
+   ```yaml
+     cuda_graph_config:
+      padding_enabled: true
+     print_iter_log: true
+     kv_cache_dtype: fp8
+     enable_attention_dp: true
+   ```
+
+The following is a list of common performance switches.
+#### `kv_cache_config`
+
+&emsp;**Description**: A section for configuring the Key-Value (KV) cache.
+
+&emsp;**Options**:
+
+&emsp;&emsp;dtype: Sets the data type for the KV cache.
+
+&emsp;&emsp;**Default**: auto (uses the data type specified in the model checkpoint).
+
+#### `cuda_graph_config`
+
+&emsp;**Description**: A section for configuring CUDA graphs to optimize performance.
+
+&emsp;**Options**:
+
+&emsp;&emsp;enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.
+
+&emsp;&emsp;**Default**: false
+
+&emsp;&emsp;max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.
+
+&emsp;&emsp;**Default**: 0
+
+&emsp;&emsp;**Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.
+
+&emsp;&emsp;batch\_sizes: A specific list of batch sizes to create CUDA graphs for.
+
+&emsp;&emsp;**Default**: None
+
+#### `moe_config`
+
+&emsp;**Description**: Configuration for Mixture-of-Experts (MoE) models.
+
+&emsp;**Options**:
+
+&emsp;&emsp;backend: The backend to use for MoE operations.
+
+&emsp;&emsp;**Default**: CUTLASS
+
+#### `attention_backend`
+
+&emsp;**Description**: The backend to use for attention calculations.
+
+&emsp;**Default**: TRTLLM
+
+See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the extra\_llm\_api\_options`.`
@@ -175,26 +175,6 @@ TRT-LLM multimodal supports the following modalities and data types (depending o
      ]}
 
 
-Benchmark
----------
-
-You can use any benchmark clients compatible with OpenAI API to test serving performance of ``trtllm_serve``, we recommend ``genai-perf`` and here is a benchmarking recipe.
-
-First, install ``genai-perf`` with ``pip``:
-
-.. code-block:: bash
-
-   pip install genai-perf
-
-Then, :ref:`start a server<Starting a Server>` with ``trtllm-serve`` and ``TinyLlama-1.1B-Chat-v1.0``.
-
-Finally, test performance with the following command:
-
-.. literalinclude:: ../../../examples/serve/genai_perf_client.sh
-    :language: bash
-    :linenos:
-
-Refer to `README <https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/README.md>`_ of ``genai-perf`` for more guidance.
 
 Multi-node Serving with Slurm
 -----------------------------
@@ -278,3 +258,6 @@ Syntax
 .. click:: tensorrt_llm.commands.serve:main
    :prog: trtllm-serve
    :nested: full
+
+Besides the above examples, `trtllm-serve` is also used as an entrypoint for performance benchmarking.
+Please refer to `Performance Benchmarking with `trtllm-serve` <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/commands/trtllm-serve/trtllm-serve-bench.md>` for more details.
@@ -75,11 +75,11 @@ Welcome to TensorRT-LLM's Documentation!
 .. toctree::
    :maxdepth: 2
    :caption: Command-Line Reference
-   :hidden:
+   :name: Command-Line Reference
 
    commands/trtllm-bench
    commands/trtllm-build
-   commands/trtllm-serve
+   commands/trtllm-serve/index
 
 
 .. toctree::
 
@@ -9,14 +9,22 @@
    Before the pre-built Python wheel can be installed via `pip`, a few
    prerequisites must be put into place:
 
+   Install CUDA Toolkit following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and
+   make sure `CUDA_HOME` environment variable is properly set.
+
    ```bash
-   # Optional step: Only required for Blackwell and Grace Hopper
+   # Optional step: Only required for NVIDIA Blackwell GPUs and SBSA platform
    pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
 
+   # Optional step: Workaround for deep_gemm installation failure on SBSA platform
+   # The actual deep_gemm package and version should be obtained from the requirements.txt file.
+   pip3 install 'deep_gemm @ git+https://github.com/zongfeijing/DeepGEMM.git@a9d538ef4dff0326fe521c6ca0bfde115703b56a' \
+       --extra-index-url https://download.pytorch.org/whl/cu128
+
    sudo apt-get -y install libopenmpi-dev
    ```
 
-   PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell and Grace Hopper GPUs. On prior GPUs, this extra installation is not required.
+   PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell GPUs and SBSA platform. On prior GPUs or Linux x86_64 platform, this extra installation is not required.
 
    ```{tip}
    Instead of manually installing the preqrequisites as described
@@ -55,16 +63,3 @@ There are some known limitations when you pip install pre-built TensorRT-LLM whe
     when OMPI was not configured --with-slurm and we weren't able
     to discover a SLURM installation in the usual places.
     ```
-
-2. CUDA Toolkit
-
-    `pip install tensorrt-llm` won't install CUDA toolkit in your system, and the CUDA Toolkit is not required if want to just deploy a TensorRT-LLM engine.
-    TensorRT-LLM uses the [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/) to quantize a model, while the ModelOpt requires CUDA toolkit to jit compile certain kernels which is not included in the pytorch to do quantization effectively.
-    Please install CUDA toolkit when you see the following message when running ModelOpt quantization.
-
-    ```
-    /usr/local/lib/python3.10/dist-packages/modelopt/torch/utils/cpp_extension.py:65:
-    UserWarning: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
-    Unable to load extension modelopt_cuda_ext and falling back to CPU version.
-    ```
-    The installation of CUDA toolkit can be found in [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/).