Skip to content

Conversation

@crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Dec 3, 2025

What does this PR do?

KV values seem to be intact even when tensor parallel is enabled, thus specify tp_size in StaticCache.

Signed-off-by: Masaki Kozuki <[email protected]>
@crcrpar crcrpar requested a review from Copilot December 3, 2025 08:50
@crcrpar crcrpar marked this pull request as ready for review December 3, 2025 08:50
Copilot finished reviewing on behalf of crcrpar December 3, 2025 08:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances tensor parallel support in the inference benchmark by specifying the tp_size parameter when initializing StaticCache for transformers >= 4.55. The changes also fix tensor parallel plan patterns to be more specific and add sanity checks to verify proper sharding.

Key Changes

  • Fixed tensor parallel plan patterns from *.layers.* to model.layers.* for more precise module matching
  • Added tp_size parameter to StaticCache initialization to properly handle sharded KV heads in tensor parallel configurations
  • Added DTensor verification assertions for attention projection weights to ensure proper sharding

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@shino16 shino16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I remember I faced the same issue at some older commits, and I was wondering how it could be reproduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants