Skip to content

Commit e5d81ee

Browse files
committed
benchmarking doc + LLM api diagram
Signed-off-by: Suyog Gupta <[email protected]>
1 parent 6ef577a commit e5d81ee

File tree

3 files changed

+103
-8
lines changed

3 files changed

+103
-8
lines changed
209 KB
Loading
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Benchmarking with trtllm-bench
2+
3+
AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.
4+
5+
## Getting Started
6+
7+
Before benchmarking with AutoDeploy, familiarize yourself with the general `trtllm-bench` workflow and best practices by reviewing the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow).
8+
9+
## Basic Usage
10+
11+
The AutoDeploy backend can be invoked by specifying `--backend _autodeploy` in your `trtllm-bench` command:
12+
13+
```bash
14+
trtllm-bench \
15+
--model meta-llama/Llama-3.1-8B \
16+
throughput \
17+
--dataset /tmp/synthetic_128_128.txt \
18+
--backend _autodeploy
19+
```
20+
21+
```{note}
22+
Similar to the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during the benchmark initialization phase.
23+
```
24+
25+
## Advanced Configuration
26+
27+
For fine-tuned control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
28+
29+
```bash
30+
trtllm-bench \
31+
--model meta-llama/Llama-3.1-8B \
32+
throughput \
33+
--dataset /tmp/synthetic_128_128.txt \
34+
--backend _autodeploy \
35+
--extra_llm_api_options autodeploy_config.yaml
36+
```
37+
38+
### Configuration Examples
39+
40+
#### Basic Performance Configuration (`autodeploy_config.yaml`)
41+
42+
```yaml
43+
# Compilation backend
44+
compile_backend: torch-opt
45+
46+
# Runtime engine
47+
runtime: trtllm
48+
49+
# Model loading
50+
skip_loading_weights: false
51+
52+
# Fraction of free memory to use for kv-caches
53+
free_mem_ratio: 0.8
54+
55+
# CUDA Graph optimization
56+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
57+
58+
# Attention backend
59+
attn_backend: flashinfer
60+
61+
# Sequence configuration
62+
max_batch_size: 256
63+
```
64+
65+
Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the number of GPUs
66+
67+
## Configuration Options Reference
68+
69+
### Core Performance Settings
70+
71+
| Parameter | Default | Description |
72+
|-----------|---------|-------------|
73+
| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
74+
| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
75+
| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
76+
| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |
77+
78+
### CUDA Graph Optimization
79+
80+
| Parameter | Default | Description |
81+
|-----------|---------|-------------|
82+
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
83+
84+
```{tip}
85+
For optimal performance with CUDA graphs, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
86+
```
87+
88+
## Performance Optimization Tips
89+
90+
1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
91+
1. **Compilation Backend**: Use `torch-opt` for production workloads
92+
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
93+
1. **CUDA Graphs**: Enable for batch sizes matching your production traffic patterns

docs/source/torch/auto_deploy/auto-deploy.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,23 @@
22

33
```{note}
44
Note:
5-
This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
5+
This project is in active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
66
```
77

88
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
99

10-
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
10+
AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from HuggingFace transformers library, to TensorRT-LLM.
1111

12-
## Motivation & Approach
12+
<div align="center">
13+
<img src="./ad_overview.png" alt="AutoDeploy integration with LLM API" width="70%">
14+
<p><em>AutoDeploy overview and relation with TensorRT-LLM's LLM api</em></p>
15+
</div>
1316

14-
Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
15-
16-
AutoDeploy addresses these challenges with a streamlined, (semi-)automated pipeline that transforms in-framework PyTorch models, including Hugging Face models, into optimized inference-ready models for TRT-LLM. It simplifies deployment, optimizes models for efficient inference, and bridges the gap between simplicity and performance.
17+
AutoDeploy provides an alternative path for deploying models using the LLM API that does not require users to rewrite the source model (e.g., HuggingFace Transformers models) or manually implement various inference optimizations such as KV-caches, multi-GPU parallelism, quantization, etc. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
1718

1819
### **Key Features:**
1920

20-
- **Seamless Model Transition:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
21+
- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
2122
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
2223
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
2324
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
@@ -50,7 +51,7 @@ python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
5051

5152
AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.
5253

53-
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.
54+
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.
5455

5556
- [Supported Matrix](support_matrix.md)
5657

@@ -60,6 +61,7 @@ The exported graph then undergoes a series of automated transformations, includi
6061
- [Logging Level](./advanced/logging.md)
6162
- [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
6263
- [Expert Configurations](./advanced/expert_configurations.md)
64+
- [Performance benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
6365

6466
## Roadmap
6567

0 commit comments

Comments
 (0)