-
Notifications
You must be signed in to change notification settings - Fork 0
[AutoDeploy] dist_ops revisited #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/ad-2025-07-07
Are you sure you want to change the base?
Conversation
lucaslie
commented
Jul 18, 2025
- added separate dist ops for torch and trtllm
- added configurability for choosing backend (torch or trtllm or automatic = previous default) and trtllm all reduce strategy
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the distributed operations implementation by separating torch and TensorRT-LLM backends with improved configurability. The changes rename existing dist ops to be more explicit about their backend (removing "dist" prefix) and add support for choosing between torch, TensorRT-LLM, or automatic backend selection.
Key changes:
- Renamed distributed operations from
torch_dist_*
totorch_*
and introduced separatetrtllm_*
ops - Added configurable backend selection with
DistBackend
enum (auto, torch, trtllm) - Introduced TensorRT-LLM all-reduce strategy configuration support
Reviewed Changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
Multiple test files | Update test expectations to use new torch_* operation names |
sharding.py |
Add backend selection logic and configuration options |
collectives.py |
Rename function and update to use TensorRT-LLM ops for fusion |
torch_dist.py |
Rename ops and add fused linear all-reduce implementation |
trtllm_dist.py |
New file implementing TensorRT-LLM specific distributed operations |
linear.py |
Remove fused linear all-reduce (moved to torch_dist.py) |
distributed/ |
Restructure distributed module organization |
Comments suppressed due to low confidence (1)
tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py:18
- [nitpick] The function name 'fuse_torch_allreduce' is inconsistent with the previous name 'fuse_collectives'. The new name is more specific but the docstring and TODO comment suggest this function may have broader applicability beyond just torch allreduce.
def fuse_torch_allreduce(gm: GraphModule) -> None:
.../auto_deploy/unit/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py
Outdated
Show resolved
Hide resolved
.../auto_deploy/unit/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lucas Liebenwein <[email protected]>
if it's not too difficult, could you run trtllm-bench for a model like llama-8B, tp2 before and after this change to make sure there are no regressions? |
from torch._ops import OpOverloadPacket | ||
from torch.fx import GraphModule, Node | ||
|
||
from .....functional import AllReduceStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this introduce a strong coupling between trtllm code and the sharding transform (which to a large extent can be agnostic to the runtime choice?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can think about a more generic way to configure it if you want that doesn't require that enum object to be imported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example we can just configure the strategy as an int
to keep it independent
rank: int | ||
world_size: int | ||
dist_backend: DistBackend = DistBackend.AUTO | ||
trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO | |
trtllm_allreduce_strategy: int = 0 |
@suyoggupta it would just be like this then and we would just convert it to the AllReduceStrategy
enum inside the trtllm-specific custom op?
1. Drop-in replacement for torch.distributed to ensure that any function in torch.distributed works | ||
out of the box. | ||
2. Provide a simple interface to spawn multiple processes and communicate with them. We support | ||
three supports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support three supports:
==> three modes?
@suyoggupta |