You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
comm: Optimizations for TRTLLM MNNVL Allreduce (#1321)
<!-- .github/pull_request_template.md -->
## 📌 Description
This PR introduces a series of optimizations to the
trtllm_mnnvl_allreduce. These optimizations are also added by
[https://github.com/NVIDIA/TensorRT-LLM/pull/5934](https://github.com/NVIDIA/TensorRT-LLM/pull/5934)
and
[https://github.com/NVIDIA/TensorRT-LLM/pull/6237](https://github.com/NVIDIA/TensorRT-LLM/pull/6237)。
- Use GPU array to pass the uc pointers in the mcast memory.
- Use L2 reduction to replace the expensive atomicAdd.
- Adjust the point of synchronization for buffer flag read.
- Optimize the lamport polling performance.
- Clean up the code structure.
- Enhance the unittest to cover more test cases.
## 🔍 Related Issues
<!-- Link any related issues here -->
## 🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).
## Reviewer Notes
<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
0 commit comments