Skip to content

Commit c22a283

Browse files
authored
Merge branch 'main' into fix/circular_import_with_torch_models
2 parents 692960c + 9778788 commit c22a283

File tree

15 files changed

+868
-54
lines changed

15 files changed

+868
-54
lines changed

.github/CODEOWNERS

Lines changed: 28 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,39 @@
66
# Without approval from a member of this team, PRs cannot be merged to release branches.
77
# * @NVIDIA/trt-llm-release-branch-approval
88

9+
## TensorRT-LLM Infra
10+
### CI
11+
/jenkins @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs
12+
### Setup
13+
/docker @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
14+
### Github workflows
15+
/.github @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
16+
/.coderabbit.yaml @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
17+
18+
## TensorRT-LLM - Docs
19+
/docs @NVIDIA/trt-llm-doc-owners
20+
21+
## Examples
22+
/examples @NVIDIA/trt-llm-doc-owners
23+
24+
## TensorRT-LLM - Triton backend
25+
/triton_backend @NVIDIA/trt-llm-triton-backend-devs
26+
927
# TensorRT-LLM Pytorch backend
1028
/tensorrt_llm/_torch @NVIDIA/trt-llm-torch-devs
29+
30+
## TensorRT-LLM Pytorch - Modules
31+
/tensorrt_llm/_torch/modules @NVIDIA/trt-llm-torch-modules
32+
33+
## TensorRT-LLM Pytorch Models
34+
/tensorrt_llm/_torch/models @NVIDIA/trt-llm-torch-models-devs
35+
/examples/models @NVIDIA/trt-llm-torch-models-devs @NVIDIA/trt-llm-doc-owners
36+
1137
## TensorRT-LLM Pytorch backend - runtime
1238
/tensorrt_llm/_torch/pyexecutor @NVIDIA/trt-llm-torch-runtime-devs
1339
## TensorRT-LLM Pytorch backend - AutoDeploy flow
1440
/tensorrt_llm/_torch/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
15-
/tensorrt_llm/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
41+
/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs @NVIDIA/trt-llm-doc-owners
1642

1743
## TensorRT-LLM Pytorch - Speculative Decoding
1844
/tensorrt_llm/_torch/speculative @NVIDIA/trt-llm-torch-spec-decoding
@@ -31,12 +57,6 @@
3157
/tensorrt_llm/_torch/attention_backend @NVIDIA/trt-llm-torch-attention-devs
3258
/tensorrt_llm/_torch/modules/attention.py @NVIDIA/trt-llm-torch-attention-devs
3359

34-
## TensorRT-LLM Pytorch - Modules
35-
/tensorrt_llm/_torch/modules @NVIDIA/trt-llm-torch-modules
36-
37-
38-
## TensorRT-LLM Pytorch Models
39-
/tensorrt_llm/_torch/models @NVIDIA/trt-llm-torch-models-devs
4060

4161
### TensorRT-LLM Pytorch - Models - Gemma
4262
/tensorrt_llm/_torch/models/modeling_gemma3.py @NVIDIA/trt-llm-torch-models-gemma-devs @NVIDIA/trt-llm-torch-models-devs
@@ -108,8 +128,6 @@
108128
/cpp/tensorrt_llm/runtime/loraUtils.cpp @NVIDIA/trt-llm-torch-peft
109129
/cpp/tensorrt_llm/runtime/loraUtils.h @NVIDIA/trt-llm-torch-peft
110130

111-
## TensorRT-LLM - Triton backend
112-
/triton_backend @NVIDIA/trt-llm-triton-backend-devs
113131

114132
## TensorRT-LLM trtllm-bench Reviewers
115133
/tensorrt_llm/bench @NVIDIA/trtllm-bench-reviewers
@@ -121,7 +139,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
121139
/tensorrt_llm/executor @NVIDIA/trt-llm-llmapi-devs
122140

123141
## TensorRT-LLM LLM Disaggregated
124-
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs
142+
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners
125143
/tensorrt_llm/disaggregated_params.py @NVIDIA/trt-llm-disagg-devs
126144
/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py @NVIDIA/trt-llm-disagg-devs
127145
/cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp @NVIDIA/trt-llm-disagg-devs
@@ -134,19 +152,6 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
134152
/cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp @NVIDIA/trt-llm-disagg-devs
135153
/cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h @NVIDIA/trt-llm-disagg-devs
136154

137-
## TensorRT-LLM Infra
138-
139-
### CI
140-
/jenkins @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs
141-
### Setup
142-
/docker @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
143-
### Github workflows
144-
/tensorrt_llm/.github @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
145-
/tensorrt_llm/.coderabbit.yaml @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
146-
147-
## TensorRT-LLM - Docs
148-
/docs @NVIDIA/trt-llm-doc-owners
149-
/examples @NVIDIA/trt-llm-doc-owners
150155

151156
# The rule below requires that any PR modifying public APIs must be approved by at least one member
152157
# of the NVIDIA/trt-llm-committed-api-review-committee or NVIDIA/trt-llm-noncommitted-api-review-committee team.

docs/source/media/ad_overview.png

209 KB
Loading

docs/source/torch.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,7 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
3838
## Known Issues
3939

4040
- The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
41+
42+
## Prototype Features
43+
44+
- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Benchmarking with trtllm-bench
2+
3+
AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.
4+
5+
## Getting Started
6+
7+
Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
8+
9+
## Basic Usage
10+
11+
Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
12+
13+
```bash
14+
trtllm-bench \
15+
--model meta-llama/Llama-3.1-8B \
16+
throughput \
17+
--dataset /tmp/synthetic_128_128.txt \
18+
--backend _autodeploy
19+
```
20+
21+
```{note}
22+
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
23+
```
24+
25+
## Advanced Configuration
26+
27+
For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
28+
29+
```bash
30+
trtllm-bench \
31+
--model meta-llama/Llama-3.1-8B \
32+
throughput \
33+
--dataset /tmp/synthetic_128_128.txt \
34+
--backend _autodeploy \
35+
--extra_llm_api_options autodeploy_config.yaml
36+
```
37+
38+
### Configuration Examples
39+
40+
#### Basic Performance Configuration (`autodeploy_config.yaml`)
41+
42+
```yaml
43+
# Compilation backend
44+
compile_backend: torch-opt
45+
46+
# Runtime engine
47+
runtime: trtllm
48+
49+
# Model loading
50+
skip_loading_weights: false
51+
52+
# Fraction of free memory to use for kv-caches
53+
free_mem_ratio: 0.8
54+
55+
# CUDA Graph optimization
56+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
57+
58+
# Attention backend
59+
attn_backend: flashinfer
60+
61+
# Sequence configuration
62+
max_batch_size: 256
63+
```
64+
65+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
66+
67+
## Configuration Options Reference
68+
69+
### Core Performance Settings
70+
71+
| Parameter | Default | Description |
72+
|-----------|---------|-------------|
73+
| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
74+
| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
75+
| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
76+
| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |
77+
78+
### CUDA Graph Optimization
79+
80+
| Parameter | Default | Description |
81+
|-----------|---------|-------------|
82+
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
83+
84+
```{tip}
85+
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
86+
```
87+
88+
## Performance Optimization Tips
89+
90+
1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
91+
1. **Compilation Backend**: Use `torch-opt` for production workloads
92+
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
93+
1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Example Run Script
2+
3+
To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:
4+
5+
```bash
6+
cd examples/auto_deploy
7+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
8+
```
9+
10+
You can configure your experiment with various options. Use the `-h/--help` flag to see available options:
11+
12+
```bash
13+
python build_and_run_ad.py --help
14+
```
15+
16+
The following is a non-exhaustive list of common configuration options:
17+
18+
| Configuration Key | Description |
19+
|-------------------|-------------|
20+
| `--model` | The HF model card or path to a HF checkpoint folder |
21+
| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
22+
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
23+
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
24+
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
25+
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
26+
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
27+
| `--args.compile-backend` | Specifies how to compile the graph at the end |
28+
| `--args.attn-backend` | Specifies kernel implementation for attention |
29+
| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
30+
| `--args.max-seq-len` | Maximum sequence length for inference/cache |
31+
| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
32+
| `--args.attn-page-size` | Page size for attention |
33+
| `--prompt.batch-size` | Number of queries to generate |
34+
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
35+
36+
For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.
37+
38+
The following is a more complete example of using the script:
39+
40+
```bash
41+
cd examples/auto_deploy
42+
python build_and_run_ad.py \
43+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
44+
--args.world-size 2 \
45+
--args.runtime "demollm" \
46+
--args.compile-backend "torch-compile" \
47+
--args.attn-backend "flashinfer" \
48+
--benchmark.enabled True
49+
```

0 commit comments

Comments
 (0)