Skip to content

Commit 8bfdb31

Browse files
committed
Merge branch 'main' into andrewch-broken-links
Accept feedback from coderabbitai Signed-off-by: Andrew Chen <[email protected]>
2 parents ee7a02a + 8062e0f commit 8bfdb31

File tree

14 files changed

+225
-85
lines changed

14 files changed

+225
-85
lines changed

docs/source/advanced/expert-parallelism.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
## Mixture of Experts (MoE)
66

7-
Mixture of Experts (MoE) architectures have been used widely recently, such as [Mistral Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). Specifically, MOE’s structure supports multiple parallel Feedforward Neural Network (FFN) layers (called experts) to replace the single FFN layer in the dense model. When tokens arrive, the router layer selects the TopK experts for each token. The corresponding hidden state of the token is then dispatched to the selected TopK experts, respectively. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
7+
Mixture of Experts (MoE) architectures have become widespread, with models such as [Mistral Mixtral 8×7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). Specifically, MoE’s structure supports multiple parallel feed-forward neural-network (FFN) layers (called experts) in place of the single FFN layer in a dense model. When tokens arrive, the router layer selects the top-k experts for each token, and the corresponding hidden state of each token is dispatched to those experts. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
88

99
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/media/moe_structure.png?raw=true" alt="moe_structure" width="500" height="auto">
1010

@@ -23,9 +23,8 @@ When both Tensor Parallel and Expert Parallel are enabled, each GPU handles a po
2323

2424
## How to Enable
2525

26-
The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting `--moe_tp_size` and `--moe_ep_size` when calling `convert_coneckpoint.py`. If only `--moe_tp_size` is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only `--moe_ep_size` is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.
26+
The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting `--moe_tp_size` and `--moe_ep_size` when calling `convert_checkpoint.py`. If only `--moe_tp_size` is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only `--moe_ep_size` is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.
2727

2828
Ensure the product of `moe_tp_size` and `moe_ep_size` is equal to `tp_size`, since the total number of MoE parallelism across all GPUs must match the total number of parallelism in other parts of the model.
2929

3030
The other parameters related to the MoE structure, such as `num_experts_per_tok` (TopK in previous context) and `num_local_experts,` can be found in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
31-
)

docs/source/advanced/speculative-decoding.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ These tokens are then forwarded to the Target model for verification.
6060
Upon verification, the Target model may return up to `K+1` tokens.
6161
Subsequently, the prompt, now updated with the accepted tokens, is sent back to the Draft model to initiate the generation of new draft tokens.
6262
This iterative process continues until a predefined stop conditions are met.
63-
An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend).
63+
An example orchestration script is available in the Triton backend repository’s
64+
[draft-target-model client example](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/client/python/draft_target_model_client.py).
6465

6566
We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).
6667

@@ -172,7 +173,7 @@ Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits
172173

173174
### Disaggregated Serving
174175

175-
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend.
176+
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE-3 using the two-model approach is supported in the PyTorch backend.
176177

177178
## Lookahead Decoding
178179

docs/source/blogs/Falcon180B-H200.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Often quantization can have adverse impacts on the accuracy of the model,
3333
however, TensorRT-LLM's AWQ decreases memory footprint of the model by **4x**
3434
while maintaining high accuracy.
3535

36-
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/5aec7af45fc0abd876fa68a9ae8c8cae084f3af3/docs/source/blogs/media/Falcon180B-H200_acc.png" alt="Falcon-180B accuracy comparison" width="600" height="auto">
36+
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/5aec7af45fc0abd876fa68a9ae8c8cae084f3af3/docs/source/blogs/media/Falcon180B-H200_acc.png?raw=true" alt="Falcon-180B accuracy comparison" width="600" height="auto">
3737

3838

3939
<sup>Preliminary measured accuracy, subject to change. </sup>

docs/source/release-notes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1045,7 +1045,7 @@ Refer to the {ref}`support-matrix-software` section for a list of supported mode
10451045
- System prompt caching
10461046
- Enabled split-k for weight-only cutlass kernels
10471047
- FP8 KV cache support for XQA kernel
1048-
- New Python builder API and `trtllm-build` command and OPT
1048+
- Added Python builder API, `trtllm-build` command, and OPT support
10491049
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API
10501050
- FHMA support for chunked attention and paged KV cache
10511051
- Performance enhancements include:

tensorrt_llm/_torch/pyexecutor/llm_request.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -477,3 +477,17 @@ def executor_request_to_llm_request(
477477
py_multimodal_data=getattr(executor_request, "py_multimodal_data",
478478
None))
479479
return llm_request
480+
481+
482+
def get_draft_token_length(request: LlmRequest) -> int:
483+
"""Get the length of draft tokens for a given request.
484+
485+
Args:
486+
request: The LlmRequest to get draft token length for
487+
488+
Returns:
489+
The number of draft tokens, or 0 if no draft tokens exist
490+
"""
491+
if request.py_draft_tokens is not None:
492+
return len(request.py_draft_tokens)
493+
return 0

0 commit comments

Comments
 (0)