-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[#7136][feat] trtllm-serve + autodeploy integration #7141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
suyoggupta
merged 8 commits into
NVIDIA:main
from
nv-auto-deploy:sg/trtllm-serve-autodeploy
Aug 22, 2025
Merged
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e9a1f59
trtllm-serve + autodeploy integration
suyoggupta 14a884e
revert debug changes
suyoggupta bb6c1c2
Disable cache block reuse for AutoDeploy
suyoggupta 4f91d4a
Merge branch 'main' into sg/trtllm-serve-autodeploy
suyoggupta b3955bc
review comments + docs
suyoggupta b41071f
lint
suyoggupta 556167c
fix typo in docs
suyoggupta 8b1357d
Merge branch 'main' into sg/trtllm-serve-autodeploy
suyoggupta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
77 changes: 77 additions & 0 deletions
77
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Serving with trtllm-serve | ||
|
|
||
| AutoDeploy integrates with the OpenAI-compatible `trtllm-serve` CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request. | ||
|
|
||
| ## Quick start | ||
|
|
||
| Launch `trtllm-serve` with the AutoDeploy backend by setting `--backend _autodeploy`: | ||
|
|
||
| ```bash | ||
| trtllm-serve \ | ||
| meta-llama/Llama-3.1-8B-Instruct \ | ||
| --backend _autodeploy \ | ||
| ``` | ||
|
|
||
| - `model`: HF name or local path | ||
| - `--backend _autodeploy`: uses AutoDeploy runtime | ||
|
|
||
| Once the server is ready, test with an OpenAI-compatible request: | ||
|
|
||
| ```bash | ||
| curl -s http://localhost:8000/v1/chat/completions \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| "model": "meta-llama/Llama-3.1-8B-Instruct", | ||
| "messages":[{"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": "Where is New York? Tell me in a single sentence."}], | ||
| "max_tokens": 32 | ||
| }' | ||
| ``` | ||
|
|
||
| ## Configuration via YAML | ||
|
|
||
| Use `--extra_llm_api_options` to supply a YAML file that augments or overrides server/runtime settings. | ||
|
|
||
| ```bash | ||
| trtllm-serve \ | ||
| meta-llama/Llama-3.1-8B \ | ||
| --backend _autodeploy \ | ||
| --extra_llm_api_options autodeploy_config.yaml | ||
| ``` | ||
|
|
||
| Example `autodeploy_config.yaml`: | ||
|
|
||
| ```yaml | ||
| # Compilation backend for AutoDeploy | ||
| compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt | ||
|
|
||
| # Runtime engine | ||
| runtime: trtllm # options: trtllm, demollm | ||
|
|
||
| # Model loading | ||
| skip_loading_weights: false # set true for architecture-only perf runs | ||
|
|
||
| # KV cache memory | ||
| free_mem_ratio: 0.8 # fraction of free GPU mem for KV cache | ||
|
|
||
suyoggupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # CUDA graph optimization | ||
| cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64] | ||
|
|
||
| # Attention backend | ||
| attn_backend: flashinfer # recommended for best performance | ||
| ``` | ||
|
|
||
| ## Limitations and tips | ||
|
|
||
| - KV cache block reuse is disabled automatically for AutoDeploy backend | ||
| - AutoDeploy backend doesn't yet support disaggregated serving. WIP | ||
| - For best performance: | ||
| - Prefer `compile_backend: torch-opt` | ||
suyoggupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - Use `attn_backend: flashinfer` | ||
| - Set realistic `cuda_graph_batch_sizes` that match expected traffic | ||
| - Tune `free_mem_ratio` to 0.8–0.9 | ||
|
|
||
| ## See also | ||
|
|
||
| - [AutoDeploy overview](../auto-deploy.md) | ||
| - [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.