[AutoDeploy] dist_ops revisited #96

lucaslie · 2025-07-18T21:24:28Z

added separate dist ops for torch and trtllm
added configurability for choosing backend (torch or trtllm or automatic = previous default) and trtllm all reduce strategy

Signed-off-by: Lucas Liebenwein <[email protected]>

Copilot

Pull Request Overview

This PR refactors the distributed operations implementation by separating torch and TensorRT-LLM backends with improved configurability. The changes rename existing dist ops to be more explicit about their backend (removing "dist" prefix) and add support for choosing between torch, TensorRT-LLM, or automatic backend selection.

Key changes:

Renamed distributed operations from torch_dist_* to torch_* and introduced separate trtllm_* ops
Added configurable backend selection with DistBackend enum (auto, torch, trtllm)
Introduced TensorRT-LLM all-reduce strategy configuration support

Reviewed Changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
Multiple test files	Update test expectations to use new `torch_*` operation names
`sharding.py`	Add backend selection logic and configuration options
`collectives.py`	Rename function and update to use TensorRT-LLM ops for fusion
`torch_dist.py`	Rename ops and add fused linear all-reduce implementation
`trtllm_dist.py`	New file implementing TensorRT-LLM specific distributed operations
`linear.py`	Remove fused linear all-reduce (moved to torch_dist.py)
`distributed/`	Restructure distributed module organization

Comments suppressed due to low confidence (1)

tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py:18

[nitpick] The function name 'fuse_torch_allreduce' is inconsistent with the previous name 'fuse_collectives'. The new name is more specific but the docstring and TODO comment suggest this function may have broader applicability beyond just torch allreduce.

def fuse_torch_allreduce(gm: GraphModule) -> None:

.../auto_deploy/unit/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

Signed-off-by: Lucas Liebenwein <[email protected]>

suyoggupta · 2025-07-18T21:39:09Z

if it's not too difficult, could you run trtllm-bench for a model like llama-8B, tp2 before and after this change to make sure there are no regressions?
@nzmora-nvidia / @galagam : we should think about a good perf testing strategy here.. unit tests that pass/fail based on some expected throughput may not be practical...

suyoggupta · 2025-07-18T21:56:08Z

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

+from torch._ops import OpOverloadPacket
 from torch.fx import GraphModule, Node

+from .....functional import AllReduceStrategy


won't this introduce a strong coupling between trtllm code and the sharding transform (which to a large extent can be agnostic to the runtime choice?)

We can think about a more generic way to configure it if you want that doesn't require that enum object to be imported?

For example we can just configure the strategy as an int to keep it independent

lucaslie · 2025-07-18T23:07:51Z

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

    rank: int
    world_size: int
+    dist_backend: DistBackend = DistBackend.AUTO
+    trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO


Suggested change

trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO

trtllm_allreduce_strategy: int = 0

@suyoggupta it would just be like this then and we would just convert it to the AllReduceStrategy enum inside the trtllm-specific custom op?

tensorrt_llm/_torch/auto_deploy/custom_ops/trtllm_dist.py

nzmora-nvidia · 2025-07-19T22:30:22Z

tensorrt_llm/_torch/auto_deploy/distributed/common.py

+1. Drop-in replacement for torch.distributed to ensure that any function in torch.distributed works
+   out of the box.
+2. Provide a simple interface to spawn multiple processes and communicate with them. We support
+   three supports:


We support three supports: ==> three modes?

nzmora-nvidia · 2025-07-19T22:48:17Z

@suyoggupta
Re #96 (comment)
Why do you think testing perf at the model level is not practical?

lucaslie added 2 commits July 18, 2025 13:05

two dist op backends

7aaf150

Signed-off-by: Lucas Liebenwein <[email protected]>

exposing all_reduce strategy

6fbd68b

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie requested review from Copilot, greg-kwasniewski1, nzmora-nvidia and suyoggupta July 18, 2025 21:24

lucaslie self-assigned this Jul 18, 2025

Copilot AI reviewed Jul 18, 2025

View reviewed changes

lucaslie changed the title ~~Ll/dist ops revisited~~ [AutoDeploy] dist_ops revisited Jul 18, 2025

reviewer comments

6c822c1

Signed-off-by: Lucas Liebenwein <[email protected]>

suyoggupta requested a review from galagam July 18, 2025 21:31

suyoggupta reviewed Jul 18, 2025

View reviewed changes

lucaslie commented Jul 18, 2025

View reviewed changes

nzmora-nvidia reviewed Jul 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoDeploy] dist_ops revisited #96

[AutoDeploy] dist_ops revisited #96

Uh oh!

lucaslie commented Jul 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suyoggupta commented Jul 18, 2025

Uh oh!

suyoggupta Jul 18, 2025

Uh oh!

lucaslie Jul 18, 2025 •

edited

Loading

Uh oh!

lucaslie Jul 18, 2025

Uh oh!

lucaslie Jul 18, 2025

Uh oh!

Uh oh!

nzmora-nvidia Jul 19, 2025

Uh oh!

nzmora-nvidia commented Jul 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO
	trtllm_allreduce_strategy: int = 0

[AutoDeploy] dist_ops revisited #96

Are you sure you want to change the base?

[AutoDeploy] dist_ops revisited #96

Uh oh!

Conversation

lucaslie commented Jul 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suyoggupta commented Jul 18, 2025

Uh oh!

suyoggupta Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

lucaslie Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucaslie Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

lucaslie Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nzmora-nvidia Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

nzmora-nvidia commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucaslie Jul 18, 2025 •

edited

Loading

nzmora-nvidia commented Jul 19, 2025 •

edited

Loading