address review feedback

Fridah-nv · Fridah-nv · commit 3ad4cfe5dfa7 · 2025-08-08T15:43:35.000-07:00
Signed-off-by: Frida Hou &lt;201670829+Fridah-nv@users.noreply.github.com&gt;
diff --git a/docs/source/torch.md b/docs/source/torch.md
@@ -39,6 +39,6 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
 
 - The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
 
-## Prototype Feature
+## Prototype Features
 
-- [AutoDeploy: Seamless Model Deployment from PyTorch to TRT-LLM](./torch/auto_deploy/auto-deploy.md)
+- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)
diff --git a/docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md b/docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
@@ -4,11 +4,11 @@ AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utilit
 
 ## Getting Started
 
-Before benchmarking with AutoDeploy, familiarize yourself with the general `trtllm-bench` workflow and best practices by reviewing the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow).
+Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
 
 ## Basic Usage
 
-The AutoDeploy backend can be invoked by specifying `--backend _autodeploy` in your `trtllm-bench` command:
+Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
 
 ```bash
 trtllm-bench \
@@ -19,12 +19,12 @@ trtllm-bench \
 ```
 
 ```{note}
-Similar to the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during the benchmark initialization phase.
+As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
 ```
 
 ## Advanced Configuration
 
-For fine-tuned control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
+For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
 
 ```bash
 trtllm-bench \
@@ -62,7 +62,7 @@ attn_backend: flashinfer
 max_batch_size: 256
 ```
 
-Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the number of GPUs
+Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
 
 ## Configuration Options Reference
 
@@ -82,12 +82,12 @@ Multi-gpu execution can be enabled by specifying `--tp n`, where `n` is the numb
 | `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
 
 ```{tip}
-For optimal performance with CUDA graphs, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
+For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
 ```
 
 ## Performance Optimization Tips
 
 1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
 1. **Compilation Backend**: Use `torch-opt` for production workloads
 1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
-1. **CUDA Graphs**: Enable for batch sizes matching your production traffic patterns
+1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.
diff --git a/docs/source/torch/auto_deploy/advanced/example_run.md b/docs/source/torch/auto_deploy/advanced/example_run.md
@@ -1,19 +1,19 @@
 # Example Run Script
 
-To build and run AutoDeploy example, use `examples/auto_deploy/build_and_run_ad.py` script:
+To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:
 
 ```bash
 cd examples/auto_deploy
 python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
 ```
 
-You can arbitrarily configure your experiment. Use the `-h/--help` flag to see available options:
+You can configure your experiment with various options. Use the `-h/--help` flag to see available options:
 
 ```bash
 python build_and_run_ad.py --help
 ```
 
-Below is a non-exhaustive list of common config options:
+The following is a non-exhaustive list of common configuration options:
 
 | Configuration Key | Description |
 |-------------------|-------------|
@@ -35,7 +35,7 @@ Below is a non-exhaustive list of common config options:
 
 For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.
 
-Here is a more complete example of using the script:
+The following is a more complete example of using the script:
 
 ```bash
 cd examples/auto_deploy
diff --git a/docs/source/torch/auto_deploy/advanced/expert_configurations.md b/docs/source/torch/auto_deploy/advanced/expert_configurations.md
@@ -1,28 +1,25 @@
 # Expert Configuration of LLM API
 
-For expert TensorRT-LLM users, we also expose the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs`
-*at your own risk* (the argument list diverges from TRT-LLM's argument list):
+For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.
 
-- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
-  _exclusively_ exposed in the `AutoDeployConfig` in `tensorrt_llm._torch.auto_deploy.llm_args`.
+- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
   Please make sure to refer to those first.
-- For expert users we expose the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args`
-  that can be used to configure the AutoDeploy `LLM` API including runtime options.
+- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
 - Note that some fields in the full `LlmArgs`
   object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
   pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
   significantly differs from the default manual workflow in TensorRT-LLM.
 - However, with the proper care the full `LlmArgs`
   objects can be used to configure advanced runtime options in TensorRT-LLM.
-- Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
+- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
 
 # Expert Configuration of `build_and_run_ad.py`
 
-For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
+For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.
 
 ## CLI Arguments with Dot Notation
 
-The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig`/`LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
+The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
 
 ```bash
 # Configure model parameters
@@ -35,7 +32,7 @@ python build_and_run_ad.py \
   --args.model-kwargs.hidden-size=2048 \
   --args.tokenizer-kwargs.padding-side=left
 
-# Configure runtime and backend settings
+# Configure runtime and backend options
 python build_and_run_ad.py \
   --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
   --args.world-size=2 \
@@ -55,7 +52,7 @@ python build_and_run_ad.py \
 
 ## YAML Configuration Files
 
-Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
+Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
 
 Create a YAML configuration file (e.g., `my_config.yaml`):
 
@@ -126,13 +123,13 @@ python build_and_run_ad.py \
 
 ## Configuration Precedence and Deep Merging
 
-The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
+The configuration system follows a precedence order in which higher priority sources override lower priority ones:
 
 1. **CLI Arguments** (highest priority) - Direct command line arguments
 1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
 1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
 
-**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
+**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
 
 ```yaml
 # Base config
@@ -152,7 +149,7 @@ args:
   world_size: 4  # This gets added
 ```
 
-**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
+**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
 
 ```bash
 # The outer yaml-configs affects the entire ExperimentConfig
@@ -166,7 +163,7 @@ python build_and_run_ad.py \
 
 ## Built-in Default Configuration
 
-Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
+Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
 
 The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
 
diff --git a/docs/source/torch/auto_deploy/advanced/logging.md b/docs/source/torch/auto_deploy/advanced/logging.md
@@ -1,6 +1,6 @@
 # Logging Level
 
-Use the following env variable to specify the logging level of our built-in logger ordered by
+Use the following env variable to specify the logging level of our built-in logger, ordered by
 decreasing verbosity;
 
 ```bash
@@ -11,4 +11,4 @@ AUTO_DEPLOY_LOG_LEVEL=ERROR
 AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
 ```
 
-The default level is `INFO`.
+The default log level is `INFO`.
diff --git a/docs/source/torch/auto_deploy/advanced/workflow.md b/docs/source/torch/auto_deploy/advanced/workflow.md
@@ -1,8 +1,8 @@
 ### Incorporating `auto_deploy` into your own workflow
 
-AutoDeploy can be seamlessly integrated into your existing workflows using TRT-LLM's LLM high-level API. This section provides a blueprint for configuring and invoking AutoDeploy within your custom applications.
+AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.
 
-Here is an example of how you can build an LLM object with AutoDeploy integration:
+The following example demonstrates how to build an LLM object with AutoDeploy integration:
 
 ```
 from tensorrt_llm._torch.auto_deploy import LLM
@@ -27,6 +27,4 @@ llm = LLM(
 
 ```
 
-Please consult the AutoDeploy `LLM` API in `tensorrt_llm._torch.auto_deploy.llm` and the
-`AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`
-for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.
+For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.
diff --git a/docs/source/torch/auto_deploy/auto-deploy.md b/docs/source/torch/auto_deploy/auto-deploy.md
@@ -1,22 +1,21 @@
 # AutoDeploy
 
 ```{note}
-Note:
-This project is in active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
+This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.
 ```
 
-<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
+### Seamless Model Deployment from PyTorch to TensorRT-LLM
 
-AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from HuggingFace transformers library, to TensorRT-LLM.
+AutoDeploy is a prototype designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models such as those from the Hugging Face Transformers library, to TensorRT-LLM.
 
 ![AutoDeploy overview](../../media/ad_overview.png)
 <sub><em>AutoDeploy overview and relation with TensorRT-LLM's LLM API</em></sub>
 
-AutoDeploy provides an alternative path for deploying models using the LLM API that does not require users to rewrite the source model (e.g., HuggingFace Transformers models) or manually implement various inference optimizations such as KV-caches, multi-GPU parallelism, quantization, etc. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
+AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT-LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling.
 
-### **Key Features:**
+### Key Feature:
 
-- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
+- **Seamless Model Translation:** Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites.
 - **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
 - **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
 - **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
@@ -26,7 +25,7 @@ AutoDeploy provides an alternative path for deploying models using the LLM API t
 
 1. **Install AutoDeploy:**
 
-AutoDeploy is accessible through TRT-LLM installation.
+AutoDeploy is included with the TRT-LLM installation.
 
 ```bash
 sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
@@ -36,9 +35,9 @@ You can refer to [TRT-LLM installation guide](../../installation/linux.md) for m
 
 2. **Run Llama Example:**
 
-You are ready to run an in-framework LLama Demo now.
+You are now ready to run an in-framework LLama Demo.
 
-The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
+The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
 
 ```bash
 cd examples/auto_deploy
@@ -51,15 +50,15 @@ AutoDeploy streamlines the model deployment process through an automated workflo
 
 The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT-LLM runtime.
 
-- [Supported Matrix](support_matrix.md)
+- [Support Matrix](support_matrix.md)
 
 ## Advanced Usage
 
 - [Example Run Script](./advanced/example_run.md)
 - [Logging Level](./advanced/logging.md)
 - [Incorporating AutoDeploy into Your Own Workflow](./advanced/workflow.md)
 - [Expert Configurations](./advanced/expert_configurations.md)
-- [Performance benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
+- [Performance Benchmarking](./advanced/benchmarking_with_trtllm_bench.md)
 
 ## Roadmap
 
diff --git a/docs/source/torch/auto_deploy/support_matrix.md b/docs/source/torch/auto_deploy/support_matrix.md