Skip to content

Commit 3ad4cfe

Browse files
committed
address review feedback
Signed-off-by: Frida Hou <[email protected]>
1 parent 1240129 commit 3ad4cfe

File tree

8 files changed

+48
-54
lines changed

8 files changed

+48
-54
lines changed

docs/source/torch.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,6 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
3939

4040
- The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
4141

42-
## Prototype Feature
42+
## Prototype Features
4343

44-
- [AutoDeploy: Seamless Model Deployment from PyTorch to TRT-LLM](./torch/auto_deploy/auto-deploy.md)
44+
- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)

docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utilit
44

55
## Getting Started
66

7-
Before benchmarking with AutoDeploy, familiarize yourself with the general `trtllm-bench` workflow and best practices by reviewing the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow).
7+
Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
88

99
## Basic Usage
1010

11-
The AutoDeploy backend can be invoked by specifying `--backend _autodeploy` in your `trtllm-bench` command:
11+
Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
1212

1313
```bash
1414
trtllm-bench \
@@ -19,12 +19,12 @@ trtllm-bench \
1919
```
2020

2121
```{note}
22-
Similar to the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during the benchmark initialization phase.
22+
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
2323
```
2424

2525
## Advanced Configuration
2626

27-
For fine-tuned control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
27+
For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
2828

2929
```bash
3030
trtllm-bench \
@@ -62,7 +62,7 @@ attn_backend: flashinfer
6262
max_batch_size: 256
6363
```
6464
65-
Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the number of GPUs
65+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
6666

6767
## Configuration Options Reference
6868

@@ -82,12 +82,12 @@ Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the numb
8282
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
8383

8484
```{tip}
85-
For optimal performance with CUDA graphs, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
85+
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
8686
```
8787

8888
## Performance Optimization Tips
8989

9090
1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
9191
1. **Compilation Backend**: Use `torch-opt` for production workloads
9292
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
93-
1. **CUDA Graphs**: Enable for batch sizes matching your production traffic patterns
93+
1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.

docs/source/torch/auto_deploy/advanced/example_run.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
# Example Run Script
22

3-
To build and run AutoDeploy example, use `examples/auto_deploy/build_and_run_ad.py` script:
3+
To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:
44

55
```bash
66
cd examples/auto_deploy
77
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
88
```
99

10-
You can arbitrarily configure your experiment. Use the `-h/--help` flag to see available options:
10+
You can configure your experiment with various options. Use the `-h/--help` flag to see available options:
1111

1212
```bash
1313
python build_and_run_ad.py --help
1414
```
1515

16-
Below is a non-exhaustive list of common config options:
16+
The following is a non-exhaustive list of common configuration options:
1717

1818
| Configuration Key | Description |
1919
|-------------------|-------------|
@@ -35,7 +35,7 @@ Below is a non-exhaustive list of common config options:
3535

3636
For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.
3737

38-
Here is a more complete example of using the script:
38+
The following is a more complete example of using the script:
3939

4040
```bash
4141
cd examples/auto_deploy

docs/source/torch/auto_deploy/advanced/expert_configurations.md

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,25 @@
11
# Expert Configuration of LLM API
22

3-
For expert TensorRT-LLM users, we also expose the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`
4-
*at your own risk* (the argument list diverges from TRT-LLM's argument list):
3+
For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.
54

6-
- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
7-
_exclusively_ exposed in the `AutoDeployConfig` in `tensorrt_llm._torch.auto_deploy.llm_args`.
5+
- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
86
Please make sure to refer to those first.
9-
- For expert users we expose the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args`
10-
that can be used to configure the AutoDeploy `LLM` API including runtime options.
7+
- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
118
- Note that some fields in the full `LlmArgs`
129
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
1310
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
1411
significantly differs from the default manual workflow in TensorRT-LLM.
1512
- However, with the proper care the full `LlmArgs`
1613
objects can be used to configure advanced runtime options in TensorRT-LLM.
17-
- Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
14+
- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
1815

1916
# Expert Configuration of `build_and_run_ad.py`
2017

21-
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
18+
For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.
2219

2320
## CLI Arguments with Dot Notation
2421

25-
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig`/`LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
22+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
2623

2724
```bash
2825
# Configure model parameters
@@ -35,7 +32,7 @@ python build_and_run_ad.py \
3532
--args.model-kwargs.hidden-size=2048 \
3633
--args.tokenizer-kwargs.padding-side=left
3734

38-
# Configure runtime and backend settings
35+
# Configure runtime and backend options
3936
python build_and_run_ad.py \
4037
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
4138
--args.world-size=2 \
@@ -55,7 +52,7 @@ python build_and_run_ad.py \
5552

5653
## YAML Configuration Files
5754

58-
Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
55+
Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
5956

6057
Create a YAML configuration file (e.g., `my_config.yaml`):
6158

@@ -126,13 +123,13 @@ python build_and_run_ad.py \
126123

127124
## Configuration Precedence and Deep Merging
128125

129-
The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
126+
The configuration system follows a precedence order in which higher priority sources override lower priority ones:
130127

131128
1. **CLI Arguments** (highest priority) - Direct command line arguments
132129
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
133130
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
134131

135-
**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
132+
**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
136133

137134
```yaml
138135
# Base config
@@ -152,7 +149,7 @@ args:
152149
world_size: 4 # This gets added
153150
```
154151

155-
**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
152+
**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
156153

157154
```bash
158155
# The outer yaml-configs affects the entire ExperimentConfig
@@ -166,7 +163,7 @@ python build_and_run_ad.py \
166163

167164
## Built-in Default Configuration
168165

169-
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
166+
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
170167

171168
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
172169

docs/source/torch/auto_deploy/advanced/logging.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Logging Level
22

3-
Use the following env variable to specify the logging level of our built-in logger ordered by
3+
Use the following env variable to specify the logging level of our built-in logger, ordered by
44
decreasing verbosity;
55

66
```bash
@@ -11,4 +11,4 @@ AUTO_DEPLOY_LOG_LEVEL=ERROR
1111
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
1212
```
1313

14-
The default level is `INFO`.
14+
The default log level is `INFO`.

docs/source/torch/auto_deploy/advanced/workflow.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
### Incorporating `auto_deploy` into your own workflow
22

3-
AutoDeploy can be seamlessly integrated into your existing workflows using TRT-LLM's LLM high-level API. This section provides a blueprint for configuring and invoking AutoDeploy within your custom applications.
3+
AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.
44

5-
Here is an example of how you can build an LLM object with AutoDeploy integration:
5+
The following example demonstrates how to build an LLM object with AutoDeploy integration:
66

77
```
88
from tensorrt_llm._torch.auto_deploy import LLM
@@ -27,6 +27,4 @@ llm = LLM(
2727
2828
```
2929

30-
Please consult the AutoDeploy `LLM` API in `tensorrt_llm._torch.auto_deploy.llm` and the
31-
`AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`
32-
for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.
30+
For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.

docs/source/torch/auto_deploy/auto-deploy.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,21 @@
11
# AutoDeploy
22

33
```{note}
4-
Note:
5-
This project is in active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
4+
This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.
65
```
76

8-
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
7+
### Seamless Model Deployment from PyTorch to TensorRT-LLM
98

10-
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from HuggingFace transformers library, to TensorRT-LLM.
9+
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models such as those from the Hugging Face Transformers library, to TensorRT-LLM.
1110

1211
![AutoDeploy overview](../../media/ad_overview.png)
1312
<sub><em>AutoDeploy overview and relation with TensorRT-LLM's LLM API</em></sub>
1413

15-
AutoDeploy provides an alternative path for deploying models using the LLM API that does not require users to rewrite the source model (e.g., HuggingFace Transformers models) or manually implement various inference optimizations such as KV-caches, multi-GPU parallelism, quantization, etc. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
14+
AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
1615

17-
### **Key Features:**
16+
### Key Feature:
1817

19-
- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
18+
- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites.
2019
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
2120
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
2221
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
@@ -26,7 +25,7 @@ AutoDeploy provides an alternative path for deploying models using the LLM API t
2625

2726
1. **Install AutoDeploy:**
2827

29-
AutoDeploy is accessible through TRT-LLM installation.
28+
AutoDeploy is included with the TRT-LLM installation.
3029

3130
```bash
3231
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
@@ -36,9 +35,9 @@ You can refer to [TRT-LLM installation guide](../../installation/linux.md) for m
3635

3736
2. **Run Llama Example:**
3837

39-
You are ready to run an in-framework LLama Demo now.
38+
You are now ready to run an in-framework LLama Demo.
4039

41-
The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
40+
The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
4241

4342
```bash
4443
cd examples/auto_deploy
@@ -51,15 +50,15 @@ AutoDeploy streamlines the model deployment process through an automated workflo
5150

5251
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.
5352

54-
- [Supported Matrix](support_matrix.md)
53+
- [Support Matrix](support_matrix.md)
5554

5655
## Advanced Usage
5756

5857
- [Example Run Script](./advanced/example_run.md)
5958
- [Logging Level](./advanced/logging.md)
6059
- [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
6160
- [Expert Configurations](./advanced/expert_configurations.md)
62-
- [Performance benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
61+
- [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
6362

6463
## Roadmap
6564

0 commit comments

Comments
 (0)