[feat] TP Sharding read from the model config (fixes #6342) #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

greg-kwasniewski1 wants to merge 8 commits into feat/ad-2025-07-22 from gk/sharding_config_v4

greg-kwasniewski1 commented Jul 24, 2025

[https://github.com/NVIDIA/issues/6342][feat] Applying sharding transformations from model config

Description

If base_model_tp_plan is present in the model config and ad_config.use_sharding_from_config == True, skip sharing pattern detection, and instead, apply the sharding from the config.

Test Coverage

tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py has been updated to test new sharding logic.

greg-kwasniewski1 requested a review from lucaslie

July 24, 2025 22:53

greg-kwasniewski1 self-assigned this

greg-kwasniewski1 added the enhancement label

lucaslie reviewed

View reviewed changes

tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/transform.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated

    
                              "o_proj",

                          ]

                          if any(attn_name in module_name for attn_name in attn_names):

                              min_local_shape = head_dim

Collaborator

lucaslie Jul 28, 2025

I see why you want head_dim here. I don't think that's a good reason to break the factory/config <> graph transform abstraction.

The name matching is also very fragile to infer whether min_local_shape is necessary and not scalable

It really seems like a corner case not worth addressing or complicating the code.

I think if we use the factory sharding config we should just use it and don't build in extra sanity checks. The config should be executed as instructed

Author

greg-kwasniewski1 Jul 30, 2025

@lucaslie but then we risk the KV head problem we had with, e.g., qwen:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_moe/configuration_qwen3_moe.py

(...)
num_key_value_heads=4,
(...)
"layers.*.self_attn.k_proj": "colwise",

So we need to prevent from "sub-head" sharding. Either we can get it from the config, or deduce it from attn_node.meta['val'].shape, but this option, arguably, is even more fragile. Which one do you prefer?

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

lucaslie reviewed

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated Show resolved Hide resolved

lucaslie reviewed

View reviewed changes

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Outdated Show resolved Hide resolved

lucaslie reviewed

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/factory.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/factory.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/factory.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/factory.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/transform.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Show resolved Hide resolved

greg-kwasniewski1 and others added 8 commits

August 5, 2025 14:33


          sharding config seems to work

15764bb

Signed-off-by: greg-kwasniewski1 <[email protected]>

up

145d273

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Updated test_tp_sharding

d8913ef

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Transformation code cleanup

ddabeb8

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Removed redundant comments

95674ca

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Changed sharding config logic

ed665b5

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Cleanup sharding interface

9116acc

Signed-off-by: greg-kwasniewski1 <[email protected]>


          Fixed sharding tests

872e572

Signed-off-by: greg-kwasniewski1 <[email protected]>

greg-kwasniewski1 force-pushed the gk/sharding_config_v4 branch from 355b46b to 872e572 Compare

August 5, 2025 21:34

greg-kwasniewski1 requested a review from suyoggupta

August 5, 2025 21:43

suyoggupta commented Aug 7, 2025

as discussed offline, let's do the following:

Rebase to TRTLLM-main
Run trtllm-bench perf tests to make sure there are no regressions

suyoggupta reviewed

View reviewed changes

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py

    
                  ROW = 0  # Split along rows (first dimension)

                  COLUMN = 1  # Split along columns (second dimension)

                  # NOTE: The names COLUMN/ROW reflect the hugging face

                  # base_tp_plan sharding notation, but since we assume Y = W @ X^T,

suyoggupta Aug 7, 2025

nit: Y = W^T * X?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels