You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/torch.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,6 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
39
39
40
40
- The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
41
41
42
-
## Prototype Feature
42
+
## Prototype Features
43
43
44
-
-[AutoDeploy: Seamless Model Deployment from PyTorch to TRT-LLM](./torch/auto_deploy/auto-deploy.md)
44
+
-[AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)
Copy file name to clipboardExpand all lines: docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,11 @@ AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utilit
4
4
5
5
## Getting Started
6
6
7
-
Before benchmarking with AutoDeploy, familiarize yourself with the general `trtllm-bench` workflow and best practices by reviewing the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow).
7
+
Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
8
8
9
9
## Basic Usage
10
10
11
-
The AutoDeploy backend can be invoked by specifying `--backend _autodeploy` in your `trtllm-bench` command:
11
+
Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
12
12
13
13
```bash
14
14
trtllm-bench \
@@ -19,12 +19,12 @@ trtllm-bench \
19
19
```
20
20
21
21
```{note}
22
-
Similar to the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during the benchmark initialization phase.
22
+
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
23
23
```
24
24
25
25
## Advanced Configuration
26
26
27
-
For fine-tuned control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
27
+
For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
28
28
29
29
```bash
30
30
trtllm-bench \
@@ -62,7 +62,7 @@ attn_backend: flashinfer
62
62
max_batch_size: 256
63
63
```
64
64
65
-
Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the number of GPUs
65
+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
66
66
67
67
## Configuration Options Reference
68
68
@@ -82,12 +82,12 @@ Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the numb
82
82
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
83
83
84
84
```{tip}
85
-
For optimal performance with CUDA graphs, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
85
+
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
86
86
```
87
87
88
88
## Performance Optimization Tips
89
89
90
90
1.**Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
91
91
1.**Compilation Backend**: Use `torch-opt` for production workloads
92
92
1.**Attention Backend**: `flashinfer` generally provides the best performance for most models
93
-
1.**CUDA Graphs**: Enable for batch sizes matching your production traffic patterns
93
+
1.**CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.
Copy file name to clipboardExpand all lines: docs/source/torch/auto_deploy/advanced/expert_configurations.md
+12-15Lines changed: 12 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,25 @@
1
1
# Expert Configuration of LLM API
2
2
3
-
For expert TensorRT-LLM users, we also expose the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`
4
-
*at your own risk* (the argument list diverges from TRT-LLM's argument list):
3
+
For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.
5
4
6
-
- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
7
-
_exclusively_ exposed in the `AutoDeployConfig` in `tensorrt_llm._torch.auto_deploy.llm_args`.
5
+
- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
8
6
Please make sure to refer to those first.
9
-
- For expert users we expose the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args`
10
-
that can be used to configure the AutoDeploy `LLM` API including runtime options.
7
+
- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
11
8
- Note that some fields in the full `LlmArgs`
12
9
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
13
10
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
14
11
significantly differs from the default manual workflow in TensorRT-LLM.
15
12
- However, with the proper care the full `LlmArgs`
16
13
objects can be used to configure advanced runtime options in TensorRT-LLM.
17
-
-Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
14
+
-Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
18
15
19
16
# Expert Configuration of `build_and_run_ad.py`
20
17
21
-
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
18
+
For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.
22
19
23
20
## CLI Arguments with Dot Notation
24
21
25
-
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig`/`LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
22
+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
26
23
27
24
```bash
28
25
# Configure model parameters
@@ -35,7 +32,7 @@ python build_and_run_ad.py \
35
32
--args.model-kwargs.hidden-size=2048 \
36
33
--args.tokenizer-kwargs.padding-side=left
37
34
38
-
# Configure runtime and backend settings
35
+
# Configure runtime and backend options
39
36
python build_and_run_ad.py \
40
37
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
41
38
--args.world-size=2 \
@@ -55,7 +52,7 @@ python build_and_run_ad.py \
55
52
56
53
## YAML Configuration Files
57
54
58
-
Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
55
+
Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
59
56
60
57
Create a YAML configuration file (e.g., `my_config.yaml`):
The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
126
+
The configuration system follows a precedence order in which higher priority sources override lower priority ones:
130
127
131
128
1. **CLI Arguments** (highest priority) - Direct command line arguments
132
129
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
133
130
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
134
131
135
-
**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
132
+
**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
136
133
137
134
```yaml
138
135
# Base config
@@ -152,7 +149,7 @@ args:
152
149
world_size: 4 # This gets added
153
150
```
154
151
155
-
**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
152
+
**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
156
153
157
154
```bash
158
155
# The outer yaml-configs affects the entire ExperimentConfig
@@ -166,7 +163,7 @@ python build_and_run_ad.py \
166
163
167
164
## Built-in Default Configuration
168
165
169
-
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
166
+
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
170
167
171
168
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
Copy file name to clipboardExpand all lines: docs/source/torch/auto_deploy/advanced/workflow.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
### Incorporating `auto_deploy` into your own workflow
2
2
3
-
AutoDeploy can be seamlessly integrated into your existing workflows using TRT-LLM's LLM high-level API. This section provides a blueprint for configuring and invoking AutoDeploy within your custom applications.
3
+
AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.
4
4
5
-
Here is an example of how you can build an LLM object with AutoDeploy integration:
5
+
The following example demonstrates how to build an LLM object with AutoDeploy integration:
6
6
7
7
```
8
8
from tensorrt_llm._torch.auto_deploy import LLM
@@ -27,6 +27,4 @@ llm = LLM(
27
27
28
28
```
29
29
30
-
Please consult the AutoDeploy `LLM` API in `tensorrt_llm._torch.auto_deploy.llm` and the
31
-
`AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`
32
-
for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.
30
+
For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.
Copy file name to clipboardExpand all lines: docs/source/torch/auto_deploy/auto-deploy.md
+11-12Lines changed: 11 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,21 @@
1
1
# AutoDeploy
2
2
3
3
```{note}
4
-
Note:
5
-
This project is in active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
4
+
This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.
6
5
```
7
6
8
-
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
7
+
###Seamless Model Deployment from PyTorch to TensorRT-LLM
9
8
10
-
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from HuggingFace transformers library, to TensorRT-LLM.
9
+
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models such as those from the Hugging Face Transformers library, to TensorRT-LLM.
<sub><em>AutoDeploy overview and relation with TensorRT-LLM's LLM API</em></sub>
14
13
15
-
AutoDeploy provides an alternative path for deploying models using the LLM API that does not require users to rewrite the source model (e.g., HuggingFace Transformers models) or manually implement various inference optimizations such as KV-caches, multi-GPU parallelism, quantization, etc. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
14
+
AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
16
15
17
-
### **Key Features:**
16
+
### Key Feature:
18
17
19
-
-**Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
18
+
-**Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites.
20
19
-**Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
21
20
-**Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
22
21
-**Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
@@ -26,7 +25,7 @@ AutoDeploy provides an alternative path for deploying models using the LLM API t
26
25
27
26
1.**Install AutoDeploy:**
28
27
29
-
AutoDeploy is accessible through TRT-LLM installation.
28
+
AutoDeploy is included with the TRT-LLM installation.
@@ -36,9 +35,9 @@ You can refer to [TRT-LLM installation guide](../../installation/linux.md) for m
36
35
37
36
2.**Run Llama Example:**
38
37
39
-
You are ready to run an in-framework LLama Demo now.
38
+
You are now ready to run an in-framework LLama Demo.
40
39
41
-
The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
40
+
The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
42
41
43
42
```bash
44
43
cd examples/auto_deploy
@@ -51,15 +50,15 @@ AutoDeploy streamlines the model deployment process through an automated workflo
51
50
52
51
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.
53
52
54
-
-[Supported Matrix](support_matrix.md)
53
+
-[Support Matrix](support_matrix.md)
55
54
56
55
## Advanced Usage
57
56
58
57
-[Example Run Script](./advanced/example_run.md)
59
58
-[Logging Level](./advanced/logging.md)
60
59
-[Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
0 commit comments