You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.
4
+
5
+
## Getting Started
6
+
7
+
Before benchmarking with AutoDeploy, familiarize yourself with the general `trtllm-bench` workflow and best practices by reviewing the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow).
8
+
9
+
## Basic Usage
10
+
11
+
The AutoDeploy backend can be invoked by specifying `--backend _autodeploy` in your `trtllm-bench` command:
12
+
13
+
```bash
14
+
trtllm-bench \
15
+
--model meta-llama/Llama-3.1-8B \
16
+
throughput \
17
+
--dataset /tmp/synthetic_128_128.txt \
18
+
--backend _autodeploy
19
+
```
20
+
21
+
```{note}
22
+
Similar to the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during the benchmark initialization phase.
23
+
```
24
+
25
+
## Advanced Configuration
26
+
27
+
For fine-tuned control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
Copy file name to clipboardExpand all lines: docs/source/torch/auto_deploy/auto-deploy.md
+10-8Lines changed: 10 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,22 +2,23 @@
2
2
3
3
```{note}
4
4
Note:
5
-
This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
5
+
This project is in active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
6
6
```
7
7
8
8
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
9
9
10
-
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
10
+
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from HuggingFace transformers library, to TensorRT-LLM.
11
11
12
-
## Motivation & Approach
12
+
<divalign="center">
13
+
<imgsrc="./ad_overview.png"alt="AutoDeploy integration with LLM API"width="70%">
14
+
<p><em>AutoDeploy overview and relation with TensorRT-LLM's LLM api</em></p>
15
+
</div>
13
16
14
-
Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
15
-
16
-
AutoDeploy addresses these challenges with a streamlined, (semi-)automated pipeline that transforms in-framework PyTorch models, including Hugging Face models, into optimized inference-ready models for TRT-LLM. It simplifies deployment, optimizes models for efficient inference, and bridges the gap between simplicity and performance.
17
+
AutoDeploy provides an alternative path for deploying models using the LLM API that does not require users to rewrite the source model (e.g., HuggingFace Transformers models) or manually implement various inference optimizations such as KV-caches, multi-GPU parallelism, quantization, etc. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
17
18
18
19
### **Key Features:**
19
20
20
-
-**Seamless Model Transition:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
21
+
-**Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
21
22
-**Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
22
23
-**Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
23
24
-**Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.
52
53
53
-
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.
54
+
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.
54
55
55
56
-[Supported Matrix](support_matrix.md)
56
57
@@ -60,6 +61,7 @@ The exported graph then undergoes a series of automated transformations, includi
60
61
-[Logging Level](./advanced/logging.md)
61
62
-[Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
0 commit comments