-
Notifications
You must be signed in to change notification settings - Fork 45
Add data parallel support #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data parallel support #80
Conversation
d4a4c41
to
5ad7ff8
Compare
a355755
to
e9bc231
Compare
Please resolve conflicts |
/run-gaudi-tests |
7fdd7dd
to
c056e11
Compare
@adobrzyn Removed unused codes and rebased. Please review again. |
c056e11
to
86c8e41
Compare
/run-gaudi-tests |
Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn |
eae75cd
to
1142665
Compare
@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks. |
/run-gaudi-tests |
Seems upstream vllm changes break gaudi plugin.
|
7ae55fd
to
afe1a72
Compare
/run-gaudi-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.
- Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
- Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
- Includes a comprehensive data parallel example script with multi-node support
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
vllm_gaudi/v1/worker/hpu_worker.py | Updates distributed initialization to handle data parallel configuration and adds dummy batch execution |
vllm_gaudi/v1/worker/hpu_model_runner.py | Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks |
vllm_gaudi/platform.py | Adds simple compile backend configuration |
vllm_gaudi/distributed/device_communicators/hpu_communicator.py | Implements dispatch/combine methods for expert parallelism with collective operations |
tests/full_tests/ci_tests.sh | Adds CI test for data parallel functionality |
examples/data_parallel.py | Provides complete example demonstrating data parallel usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
/run-gaudi-tests |
Signed-off-by: Wuxun Zhang <[email protected]>
/run-gaudi-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall, couple of nitpicks here and there
Signed-off-by: Wuxun Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks!
/run-gaudi-tests |
Signed-off-by: Wuxun Zhang <[email protected]>
9508117
to
fa5da77
Compare
/run-gaudi-tests |
Signed-off-by: Wuxun Zhang <[email protected]>
Head branch was pushed to by a user without write access
3fd5295
to
a7ca264
Compare
/run-gaudi-tests |
/run-gaudi-tests |
4 similar comments
/run-gaudi-tests |
/run-gaudi-tests |
/run-gaudi-tests |
/run-gaudi-tests |
the CI is unstable and 8 reruns were unsuccessful, all due to infra issues, i'm merging this as is - CI passed multiple times previously |
This is to add data parallel support for V1 gaudi plugin. - [x] add dp aware padding - [x] use all_gather and reduce_scatter - [x] add data parallel example --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>
This is to add data parallel support for V1 gaudi plugin.