Skip to content

Conversation

lucaslie
Copy link
Collaborator

  • added separate dist ops for torch and trtllm
  • added configurability for choosing backend (torch or trtllm or automatic = previous default) and trtllm all reduce strategy

lucaslie added 2 commits July 18, 2025 13:05
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the distributed operations implementation by separating torch and TensorRT-LLM backends with improved configurability. The changes rename existing dist ops to be more explicit about their backend (removing "dist" prefix) and add support for choosing between torch, TensorRT-LLM, or automatic backend selection.

Key changes:

  • Renamed distributed operations from torch_dist_* to torch_* and introduced separate trtllm_* ops
  • Added configurable backend selection with DistBackend enum (auto, torch, trtllm)
  • Introduced TensorRT-LLM all-reduce strategy configuration support

Reviewed Changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Multiple test files Update test expectations to use new torch_* operation names
sharding.py Add backend selection logic and configuration options
collectives.py Rename function and update to use TensorRT-LLM ops for fusion
torch_dist.py Rename ops and add fused linear all-reduce implementation
trtllm_dist.py New file implementing TensorRT-LLM specific distributed operations
linear.py Remove fused linear all-reduce (moved to torch_dist.py)
distributed/ Restructure distributed module organization
Comments suppressed due to low confidence (1)

tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py:18

  • [nitpick] The function name 'fuse_torch_allreduce' is inconsistent with the previous name 'fuse_collectives'. The new name is more specific but the docstring and TODO comment suggest this function may have broader applicability beyond just torch allreduce.
def fuse_torch_allreduce(gm: GraphModule) -> None:

@lucaslie lucaslie changed the title Ll/dist ops revisited [AutoDeploy] dist_ops revisited Jul 18, 2025
Signed-off-by: Lucas Liebenwein <[email protected]>
@suyoggupta suyoggupta requested a review from galagam July 18, 2025 21:31
@suyoggupta
Copy link

if it's not too difficult, could you run trtllm-bench for a model like llama-8B, tp2 before and after this change to make sure there are no regressions?
@nzmora-nvidia / @galagam : we should think about a good perf testing strategy here.. unit tests that pass/fail based on some expected throughput may not be practical...

from torch._ops import OpOverloadPacket
from torch.fx import GraphModule, Node

from .....functional import AllReduceStrategy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this introduce a strong coupling between trtllm code and the sharding transform (which to a large extent can be agnostic to the runtime choice?)

Copy link
Collaborator Author

@lucaslie lucaslie Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can think about a more generic way to configure it if you want that doesn't require that enum object to be imported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example we can just configure the strategy as an int to keep it independent

rank: int
world_size: int
dist_backend: DistBackend = DistBackend.AUTO
trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trtllm_allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO
trtllm_allreduce_strategy: int = 0

@suyoggupta it would just be like this then and we would just convert it to the AllReduceStrategy enum inside the trtllm-specific custom op?

1. Drop-in replacement for torch.distributed to ensure that any function in torch.distributed works
out of the box.
2. Provide a simple interface to spawn multiple processes and communicate with them. We support
three supports:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support three supports: ==> three modes?

@nzmora-nvidia
Copy link

nzmora-nvidia commented Jul 19, 2025

@suyoggupta
Re #96 (comment)
Why do you think testing perf at the model level is not practical?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants